基于BERT和深度主动学习的农业新闻文本分类方法

doi:10.13998/j.cnki.issn1002-1248.22-0172

Abstract

Abstract: [Purpose/Significance] At present, most of the training models used in the research of news classification are non-active learning. There are common problems about these models, including data cannot be labeled immediately and the labeling cost is too high, which also hinders the analysis of agricultural news. Especially because of the explosive growth of news data in the network era, it is more difficult to label data, train supervised text classification models, and screen relevant news in the field of agriculture from diversified online news sources. In order to solve this problem, the most commonly used pool based active learning or deep active learning technique is used to select more valuable and representative data from unlabeled data for manual labeling, and construct labeled data sets to improve the efficiency and effect of news classification and agricultural news mining. [Method/Process] The commonly used machine learning models for text classification, such as random forest classifier, polynomial naive Bayes classifier and logistic regression classifier, were combined with the active learning method with the lowest confidence to analyze the effect, and the BERT model was combined with the three sampling strategies of discriminative active learning, deep Bayes active learning and lowest confidence for deep active learning training. On the news corpus of 19 847 samples crawled and cleaned by crawler technology from Sina and other news websites, aiming at screening agricultural related news from diversified news samples of various topics, the iterative experiment of adding 30 samples per round was tested to check the improvement effect of F₁ score under various method combinations with the increase of the number of annotation. In addition, the representativeness and diversity of the samples selected by the sampling function of each method in the deep active learning method of the BERT model were compared, so as to understand the characteristics of each strategy and provide inspiration for the selection and improvement of Al strategy in the future. In addition, this paper also analyzed how much labeling cost can be saved by using the proposed method. [Results/Conclusions] When comparing a variety of machine learning models, it is found that although the gradient boosting tree and support vector machine classifier have high accuracy, they are not suitable for active learning because of their low efficiency in text data processing of large-scale high-dimensional data. After combining other machine learning models and the BERT model and training text models with the corresponding active learning or deep active learning methods, it is found that the application of active learning method can significantly improve the training process of each model. Among them, the BERT model, combined with discriminative active learning sampling function, has the best news text classification effect and the lowest annotation data requirements. The representativeness and diversity of the samples selected by discriminative active learning sampling function are also the highest, which explains the source of the advantages of this method. It can also be found that for the same task model, the higher the accuracy of classification is required, and the active learning method can save more annotation cost than non-active learning.

Key words: deep learning, agricultural news, text classification, BERT model, active learning

CLC Number:

TP391.1

SHI Yunlai, CUI Yunpeng, DU Zhigang. A Classification Method of Agricultural News Text Based on BERT and Deep Active Learning[J].Journal of Library and Information Science in Agriculture, 2022, 34(8): 19-29.

References

[1] 许丽, 焦博, 赵章瑞. 基于TF-IDF 的加权朴素贝叶斯新闻文本分类算法[J]. 网络安全技术与应用, 2021, 11: 31-33.
XU L, JIAO B, ZHAO Z R.Weighted naive bayesian news text classification algorithm based on TF-IDF[J]. Network security technology & application, 2021, 11: 31-33.
[2] 郭文强, 李嫔. 基于SVM的新冠疫情虚假新闻检测[J]. 佛山科学技术学院学报(自然科学版), 2021, 39(6): 19-26.
GUO W Q, LI P.False news detection in the background of COVID-19 based on SVM[J]. Journal of Foshan university(natural science edition), 2021, 39(6): 19-26.
[3] 田沛霖, 符海滕, 马力禹, 等. 融合对抗训练和CNN-BiGRU神经网络的新闻文本分类模型[J]. 图书情报导刊, 2021, 6(8): 38-45.
TIAN P L, FU H T, MA L Y, et al.News text classification model based on adversarial training and CNN-BiGRU neural network[J]. Journal of library and information science, 2021, 6(8): 38-45.
[4] 刘子昂, 蒋雪, 伍冬睿. 基于池的无监督线性回归主动学习[J]. 自动化学报, 2021, 47(12): 2771-2783.
LIU Z A, JIANG X, WU D R.Unsupervised pool-based active learning for linear regression[J]. Acta automatica sinica, 2021, 47(12): 2771-2783.
[5] 黄永毅, 龚垒. 基于主动学习的交互式支持向量机文本分类学习方法[J]. 电子技术与软件工程, 2016, 14(14): 168-168.
HUANG Y Y, GONG L.Interactive support vector machine text classification learning method based on active learning[J]. Electronic technology & software engineering, 2016, 14(14): 168-168.
[6] 邱宁佳, 丛琳, 周思丞, 等. 结合改进主动学习的 SVD-CNN 弹幕文本分类算法[J]. 计算机应用, 2019, 39(3): 644-650.
QIU N J, CONG L, ZHOU S C, et al.SVD-CNN barrage text classifi-cation algorithm combined with improved active learning[J]. Journal of computer applications, 2019, 39(3): 644-650.
[7] 张智雄, 刘欢, 于改红. 构建基于科技文献知识的人工智能引擎[J]. 农业图书情报学报, 2021, 33(1): 17-31.
ZHANG Z X, LIU H, YU G H.Building an artificial intelligence engine based on scientific and technological literature knowledge[J]. Journal of library and information science in agriculture, 2021, 33(1): 17-31.
[8] SENER O, SAVARESE S.Active learning for convolutional neural networks: A core-set approach[J]. Stat, 2018, 1050(2): 21.
[9] GAL Y, GHAHRAMANI Z.Dropout as a bayesian approximation: Representing model uncertainty in deep learning[C]. International conference on machine learning, 2016: 1050-1059.
[10] 杨承文, 李吉明, 杨东勇. 基于深度贝叶斯主动学习的高光谱图像分类[J]. 计算机工程与应用, 2019, 55(18): 166-172.
YANG C W, LI J M, YANG D Y.Active learning for hyperspectral image classification with deep bayesian[J]. Computer engineering and applications, 2019, 55(18): 166-172.
[11] DOR L E, HALFON A, GERA A, et al.Active learning for BERT: An empirical study[C]. Proceedings of the 2020 conference on empirical methods in natural language processing(EMNLP), 2020: 7949-7962.
[12] HONEY J, LYNCH C D, BURKE F, et al.Ready for practice? A study of confidence levels of final year dental students at Cardiff university and university college cork[J]. European journal of dental education, 2011, 15(2): 98-103.
[13] BELUCH W H, GENEWEIN T, NüRNBERGER A, et al. The power of ensembles for active learning in image classification[C]. Proceedings of the IEEE conference on computer vision and pattern recognition, 2018: 9368-9377.
[14] 李涛, 郭渊博, 琚安康. 融合对抗主动学习的网络安全知识三元组抽取[J]. 通信学报, 2020, 41(10): 80-91.
LI T, GUO Y B, JU A K.Knowledge triple extraction in cybersecu-rity with adversarial active learning[J]. Journal on communications, 2020, 41(10): 80-91.
[15] 徐睿, 梁循, 齐金山, 等. 极限学习机前沿进展与趋势[J]. 计算机学报, 2019, 42(7): 1640-1670.
XU R, LIANG X, QI J S, et al.Advances and trends in extreme learning machine[J]. Chinese journal of computers, 2019, 42(7): 1640-1670.
[16] BIAU G, SCORNET E.A random forest guided tour[J]. Test, 2016, 25(2): 197-227.
[17] RISH I.An empirical study of the naive bayes classifier[C]. IJCAI 2001 workshop on empirical methods in artificial intelligence, 2001: 41-46.
[18] 赵春晖, 高冰, 赵晨. 基于支持向量机和逻辑回归的半监督空谱加权的高光谱图像分类[J]. 黑龙江大学工程学报, 2019, 10(4): 64-72.
ZHAO C H, GAO B, ZHAO C.Semi-supervised spectral-spatial weighted classification of hyperspectral image based on SVMSLR framework[J]. Journal of Heilongjiang hydraulic engineering college, 2019, 10(4): 64-72.
[19] FRIEDMAN J H.Greedy function approximation: A gradient boosting machine[J]. Annals of statistics, 2001, 29(5): 1189-1232.
[20] NOBLE W S.What is a support vector machine?[J]. Nature biotechnology, 2006, 24(12): 1565-1567.
[21] RAMOS J.Using TF-IDF to determine word relevance in document queries[C]. Proceedings of the first instructional conference on machine learning, 2003: 29-48.
[22] HAN K, XIAO A, WU E, et al.Transformer in transformer[J]. Advances in neural information processing systems, 2021, 34(2): 15908-15919.
[23] BADRINARAYANAN V, KENDALL A, CIPOLLA R.Segnet: A deep convolutional encoder-decoder architecture for image segmentation[J]. IEEE transactions on pattern analysis and machine intelligence, 2017, 39(12): 2481-2495.
[24] 鞠默然, 罗江宁, 王仲博, 等. 融合注意力机制的多尺度目标检测算法[J]. 光学学报, 2020, 40(13): 1315002.
JU M R, LUO J N, WANG Z B, et al.Multi-scale target detection algorithm based on attention mechanism[J]. Acta optica sinica, 2020, 40(13): 1315002.
[25] PRECHELT L.Early stopping - But when?[J]. Neural networks: Tricks of the trade: Springer, 1998(1524): 55-69.
[26] REN P, XIAO Y, CHANG X, et al.A survey of deep active learning[J]. ACM computing surveys(CSUR), 2021, 54(9): 1-40.
[27] XIAO T, CAO F, LI T, et al.KNN and re-ranking models for English patent mining at NTCIR-7[C]. NTCIR, 2008.
[28] ALBERT-WEISS D, OSMAN A.Interactive deep learning for shelf life prediction of muskmelons based on an active learning approach[J]. Sensors, 2022, 22(2): 414-422.
[29] 金瑛, 叶飒, 李洪磊. 基于ResNet-50深度卷积网络的果树病害智能诊断模型研究[J]. 农业图书情报学报, 2021, 33(4): 58-67.
JIN Y, YE S, LI H L.The intelligent diagnosis model of fruit tree disease based on ResNet-50[J]. Journal of library and information science in agriculture, 2021, 33(4): 58-67.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

A Classification Method of Agricultural News Text Based on BERT and Deep Active Learning

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 6

Metrics

Comments

Recommended 0

[1]	HOU Xiangying, CUI Yunpeng, LIU Juan. Applications and Prospect Analysis of Deep Learning in Plant Genomics and Crop Breeding [J]. Journal of Library and Information Science in Agriculture, 2022, 34(8): 4-18.
[2]	MAO Jin, CHEN Ziyang. A Deep Learning Based Approach to Structural Function Recognition of Scientific Literature Abstracts [J]. Journal of Library and Information Science in Agriculture, 2022, 34(3): 15-27.
[3]	LYU Lucheng, HAN Tao. Artificial Intelligence Empowers Library and Information Service ——Review of Forums about Information Technology for Library 2019 [J]. Journal of Library and Information Science in Agriculture, 2020, 32(5): 13-18.
[4]	WANG Xuejing. Research on Intelligent Service Mode of Digital Library Based on Deep Learning Technology [J]. , 2018, 30(9): 150-153.
[5]	LUO Xin. Comparative Study of Chinese Text Classification Model based on Particle Swarm Intelligence [J]. , 2018, 30(4): 18-22.
[6]	LUO Xin. Research on text Classification Model Based on Random Forests [J]. , 2016, 28(11): 50-53.