农业图书情报学报 ›› 2022, Vol. 34 ›› Issue (8): 19-29.doi: 10.13998/j.cnki.issn1002-1248.22-0172

• 农业深度学习专题 • 上一篇    下一篇

基于BERT和深度主动学习的农业新闻文本分类方法

石运来1, 崔运鹏1,*, 杜志钢2   

  1. 1.中国农业科学院农业信息研究所,北京 100081;
    2.淄博市数字农业农村发展中心,淄博 255000
  • 收稿日期:2022-03-18 出版日期:2022-08-05 发布日期:2022-10-26
  • 通讯作者: *崔运鹏(1972- ),男,研究员,博士生导师,研究方向为农业信息技术、农业知识管理、数据挖掘技术研究。Email:cuiyunpeng@caas.cn
  • 作者简介:石运来(1996- ),男,硕士,研究方向为自然语言处理、主动学习。杜志钢(1979- ),男,硕士,高级工程师,研究方向为农业农村信息化、大数据分析、建模
  • 基金资助:
    国家科技图书文献中心(NSTL)文献专项任务(2021XM45)

A Classification Method of Agricultural News Text Based on BERT and Deep Active Learning

SHI Yunlai1, CUI Yunpeng1,*, DU Zhigang2   

  1. 1. Agricultural Information Institute of CAAS, Beijing 100081;
    2. Zibo Digital Agricultural Rural Development Center, Zibo 255000
  • Received:2022-03-18 Online:2022-08-05 Published:2022-10-26

摘要: [目的/意义]当前农业新闻分类研究中的模型训练以被动学习方式居多,普遍存在数据无法即时标注及标注成本过高的问题,对农业新闻分析工作也造成了一定阻碍。为解决该问题,运用主动学习或者深度主动学习技术从未标注数据中选择更有价值和代表性的数据进行人工标注并构建标注数据集,提升农业新闻挖掘工作效率和效果。[方法/过程]将文本分类常用的机器学习模型结合主动学习方法分析提升效果,以及使用BERT模型结合3种采样策略进行深度主动学习训练,在共19 847条样本的新闻爬虫语料上以筛选出农业相关新闻为目标,通过每轮增加30个样本标注的迭代实验进行测试。[结果/结论]实验结果表明:主动学习方法的应用对各个模型的训练过程均有明显提升。其中BERT模型配合判别性主动学习采样函数,具有最优的新闻文本分类效果和最低的标注数据需求。

关键词: 深度学习, 农业新闻, 文本分类, BERT模型, 主动学习

Abstract: [Purpose/Significance] At present, most of the training models used in the research of news classification are non-active learning. There are common problems about these models, including data cannot be labeled immediately and the labeling cost is too high, which also hinders the analysis of agricultural news. Especially because of the explosive growth of news data in the network era, it is more difficult to label data, train supervised text classification models, and screen relevant news in the field of agriculture from diversified online news sources. In order to solve this problem, the most commonly used pool based active learning or deep active learning technique is used to select more valuable and representative data from unlabeled data for manual labeling, and construct labeled data sets to improve the efficiency and effect of news classification and agricultural news mining. [Method/Process] The commonly used machine learning models for text classification, such as random forest classifier, polynomial naive Bayes classifier and logistic regression classifier, were combined with the active learning method with the lowest confidence to analyze the effect, and the BERT model was combined with the three sampling strategies of discriminative active learning, deep Bayes active learning and lowest confidence for deep active learning training. On the news corpus of 19 847 samples crawled and cleaned by crawler technology from Sina and other news websites, aiming at screening agricultural related news from diversified news samples of various topics, the iterative experiment of adding 30 samples per round was tested to check the improvement effect of F1 score under various method combinations with the increase of the number of annotation. In addition, the representativeness and diversity of the samples selected by the sampling function of each method in the deep active learning method of the BERT model were compared, so as to understand the characteristics of each strategy and provide inspiration for the selection and improvement of Al strategy in the future. In addition, this paper also analyzed how much labeling cost can be saved by using the proposed method. [Results/Conclusions] When comparing a variety of machine learning models, it is found that although the gradient boosting tree and support vector machine classifier have high accuracy, they are not suitable for active learning because of their low efficiency in text data processing of large-scale high-dimensional data. After combining other machine learning models and the BERT model and training text models with the corresponding active learning or deep active learning methods, it is found that the application of active learning method can significantly improve the training process of each model. Among them, the BERT model, combined with discriminative active learning sampling function, has the best news text classification effect and the lowest annotation data requirements. The representativeness and diversity of the samples selected by discriminative active learning sampling function are also the highest, which explains the source of the advantages of this method. It can also be found that for the same task model, the higher the accuracy of classification is required, and the active learning method can save more annotation cost than non-active learning.

Key words: deep learning, agricultural news, text classification, BERT model, active learning

中图分类号: 

  • TP391.1

引用本文

石运来, 崔运鹏, 杜志钢. 基于BERT和深度主动学习的农业新闻文本分类方法[J]. 农业图书情报学报, 2022, 34(8): 19-29.

SHI Yunlai, CUI Yunpeng, DU Zhigang. A Classification Method of Agricultural News Text Based on BERT and Deep Active Learning[J]. Journal of Library and Information Science in Agriculture, 2022, 34(8): 19-29.