研究论文

结构信息增强的文献分类方法研究

展开
  • 1.中国社会科学院 民族学与人类学研究所,北京 100081;
    2.中国科学院软件研究所,北京 100190
安波(1986- ),男,博士,副研究员,研究方向为自然语言处理、知识图谱。E-mail:anbo@cass.org.cn

收稿日期: 2023-02-08

  网络出版日期: 2023-04-04

基金资助

国家自然科学基金项目“知识增强的中文复述识别关键技术研究”(62076233); 国家社会科学基金项目“藏汉双语藏文古籍知识图谱构建研究”(22BTQ010)

Literature Classification Methods based on Structural Information Enhancement

Expand
  • 1. Institute of Ethnology and Anthropology, Chinese Academy of Social Sciences, Beijing 100081;
    2. Institute of Software, Chinese Academy of Sciences, Beijing 100190

Received date: 2023-02-08

  Online published: 2023-04-04

摘要

[目的/意义]针对传统文献分类方法未能充分利用文献结构信息的问题,本文提出使用关键词-文献图网络构建文献之间的结构信息,并用于增强传统基于文献内容的分类方法。[方法/过程]本文借助图卷积神经网络建模关键词-文献图数据,学习文献在图网络中的节点表示。同时使用Bert+BiLSTM学习文献的内容表示。然后,我们将文献的节点表示与内容表示进行拼接,得到融合文献结构信息和文本语义信息的表示,并基于该表示开展文献分类。[结果/结论]实验结果表明,文献的结构信息能够提升文献分类的性能,但单一的结构信息并不能很好地实现文献分类。通过错误分析,我们发现模型在处理包含新兴交叉科学和新概念的文献时容易出现分类错误,表明模型在处理这类数据时还有一定的局限性,是未来需要继续优化的方向。

本文引用格式

安波 . 结构信息增强的文献分类方法研究[J]. 农业图书情报学报, 2023 , 35(3) : 15 -24 . DOI: 10.13998/j.cnki.issn1002-1248.23-0059

Abstract

[Purpose/Significance] Literature classification is a fundamental task in library and information service, which is of great value for information resource management, and literature retrieval and acquisition. Deep learning-based literature classification methods are the current mainstream methods in text classification, which employ neural networks to model and use the textual content for literature classification. This approach only utilizes the information of the literature itself, but ignores the knowledge of the association between the literature. By observing the data, we found that literature in the same category tends to share more keyword information. The literature can build association networks through keywords to form structural relationships between literature. We attempt to utilize this structural in-formation to improve the performance of literature classification. [Methods/Process] This paper proposes a method that can model the structural representation of the literature and employ this representation to enhance traditional literature classification methods. Specifi-cally, we first constructed a large-scale keyword dictionary based on the collected data from about 930,000 documents. Second, we extracted the keyword set from the titles and abstracts of papers by a two-way maximum matching algorithm and constructed the keyword-literature graph data with the literature and keywords as nodes and the inclusion relationship between the documents and keywords as edges. The literature was connected with each other by keywords. Furthermore, we employed graph convolutional neural network to model the literature graph and learn the representation of literature and keywords in the keyword-literature graph. The literature representation generated by graph neural network contained the structural relationships between the literature. In addition, we employed Bert+BiLSTM to model the textual content representation of literature. Finally, the structural and textual representations of the literature were concatenated, and the classification of the literature was performed based on this representation. [Results/Conclusions] We constructed a literature classification dataset containing 423 classes and divided the training set, validation set and test set according to the ratio of 8:1:1. We conducted literature classification experiments on this dataset. The experimental results show that the structural information of literature can effectively enhance the performance of traditional literature classification methods. The results of the stripping experiments also show that the structural information alone is insufficient for the literature classification task. Through detailed analysis of the error data, we found that the model still has problems in handling some less frequent keywords and concepts. In the future, we plan to use small-sample learning methods to solve the classification problem for literature categories with less data.

参考文献

[1] 张智雄, 赵旸, 刘欢. 构建面向实际应用的科技文献自动分类引擎[J]. 中国图书馆学报, 2022, 48(4): 104-115.
ZHANG Z X, ZHAO Y, LIU H.Construction of a practical application-oriented automatic classification engine for scientific literature[J]. Journal of library science in China, 2022, 48(4): 104-115.
[2] 李清, 侯荣理, 张馨. 《中国图书馆分类法》类目注释问题探讨[J]. 数字图书馆论坛, 2022(1): 47-51.
LI Q, HOU R L, ZHANG X.Discussion on some problems of class annotation in Chinese library classification[J]. Digital library forum, 2022(1): 47-51.
[3] 雷兵, 刘小, 钟镇. 基于题录信息的领域学术文献细粒度分类方法研究[J]. 图书情报工作, 2021, 65(14): 128-137.
LEI B, LIU X, ZHONG Z.Research on fine-grain classification method of academic literature based on bibliographies[J]. Library and information service, 2021, 65(14): 128-137.
[4] 谢红玲, 奉国和, 何伟林. 基于深度学习的科技文献语义分类研究[J]. 情报理论与实践, 2018, 41(11): 149-154.
XIE H L, FENG G H, HE W L.Research on semantic classification of scientific and technical literature based on deep learning[J].
5 Information studies: Theory & application, 2018, 41(11): 149-154.
[5] 陈德光, 马金林, 马自萍, 等. 自然语言处理预训练技术综述[J]. 计算机科学与探索, 2021, 15(8): 1359-1389.
CHEN D G, MA J L, MA Z P, et al.Review of pre-training tech-niques for natural language processing[J]. Journal of frontiers of computer science and technology, 2021, 15(8): 1359-1389.
[6] 沈立力, 姜鹏, 王静. 基于BERT模型的中文期刊文献自动分类实践研究[J]. 图书馆杂志, 2022, 41(5): 109-118, 135.
SHEN L L, JIANG P, WANG J.A study on the automatic classification of Chinese literature in periodicals based on BERT model[J]. Library journal, 2022, 41(5): 109-118, 135.
[7] 马帅, 刘建伟, 左信. 图神经网络综述[J]. 计算机研究与发展, 2022, 59(1): 47-80.
MA S, LIU J W, ZUO X.Survey on graph neural network[J]. Journal of computer research and development, 2022, 59(1): 47-80.
[8] 宁懿昕, 谢辉, 姜火文. 图神经网络社区发现研究综述[J]. 计算机科学, 2021, 48(s2): 11-16.
NING Y X, XIE H, JIANG H W.Survey of graph neural network in community detection[J]. Computer science, 2021, 48(s2): 11-16.
[9] 侯汉清, 黄刚. 电子计算机与文献分类[J]. 计算机与图书馆, 1982(1): 5-14.
HOU H Q, HUANG G.Computer and document classification[J]. Data analysis and knowledge discovery, 1982(1): 5-14.
[10] 叶新明, 徐进鸿. 中文文献自动分类研究[J]. 情报科学, 1992(5): 31-34.
YE X M, XU J H.Research on automatic classification of Chinese documents[J]. Information science, 1992(5): 31-34.
[11] 庞观松, 蒋盛益. 文本自动分类技术研究综述[J]. 情报理论与实践, 2012, 35(2): 123-128.
PANG S G, JIANGS Y.A survey of automatic text classification technology[J]. Information studies: Theory & application, 2012, 35(2): 123-128.
[12] 周丽红, 刘勘. 基于关联规则的科技文献分类研究[J]. 图书情报工作, 2012, 56(4): 12-16, 119.
ZHOU L H, LIU K.Research on classification of scientific and technological documents based on association rules[J]. Library and information service, 2012, 56(4): 12-16, 119.
[13] 王方, 阮梅花, 朱海刚, 等. 基于向量空间模型的科技文献自动分类研究[J]. 情报探索, 2013(12): 1-3, 8.
WANGF, RUAN M H, ZHU H G, et al. Research on vector space model-based automatic classification of sci-tech document[J]. Information research, 2013(12): 1-3, 8.
[14] 李彦轩. 基于摘要的论文分类与推荐模型的研究与实现[D]. 北京: 北京邮电大学, 2019.
LI Y X.Research and implementation of abstract-based paper classification and recommendation model[D]. Beijing: Beijing uni-versity of posts and telecommunications, 2019.
[15] 何浩, 杨海棠. 一种基于N-Gram技术的中文文献自动分类方法[J]. 情报学报, 2002(4): 421-427.
HE H, YANG H T.Approach of chinese document automatic classification based on the frequency of N-Gram[J]. Journal of the China society for scientific and technical information, 2002(4): 421-427.
[16] 王颖. 科技文献内容语义描述模型研究[J].农业图书情报学报,2020, 32(8): 12-24.
WANG Y.Semantic models for the content of scientific literature[J]. Journal of library and information science in agriculture, 2020, 32(8): 12-24.
[17] 赵旸, 张智雄, 刘欢. 基于层次分类法的中文医学文献分类研究[J]. 图书馆学研究, 2021(21): 49-55, 61.
ZHAO Y, ZHANG Z X, LIU H.Research on chinese medical literature classification based on hierarchical classification[J]. Research on library science, 2021(21): 49-55, 61.
[18] 张晓丹. 改进的图神经网络文本分类模型应用研究——以NSTL科技期刊文献分类为例[J]. 情报杂志, 2021, 40(1): 184-188.
ZHANG X D.The application of improved graph convolutional neural network in big data classification of scientific and technological documents[J]. Journal of intelligence, 2021, 40(1): 184-188.
[19] GORI M, MONFARDINI G, SCARSELLI F.A new model for learn-ing in graph domains[C]. Proceedings of the IEEE international joint conference on neural networks, IEEE, 2005: 729-734.
[20] BRUNA J, ZAREMBA W, SZLAM A, et al.Spectral networks and locally connected net-works on graphs[J/OL]. arXiv Preprint, arXiv: 1312.6203.
[21] 杨旭华, 金鑫, 陶进, 等. 基于图神经网络和依存句法分析的文本分类[J]. 计算机科学, 2022, 49(12): 293-300.
ZHANG X H, XIN J, TAO J, et al.Text classification based on graph neural networks and dependency parsing[J]. Computer science, 2022, 49(12): 293-300.
[22] 王婷, 朱小飞, 唐顾. 基于知识增强的图卷积神经网络的文本分类[J]. 浙江大学学报(工学版), 2022, 56(2): 322-328.
WANG T, ZHU X F, TANG G.Knowledge-enhanced graph convolutional neural networks for text classification[J]. Journal of Zhejiang university(engineering science), 2022, 56(2): 322-328.
[23] 胡春华, 邓奥, 童小芹, 等. 社交电商中融合信任和声誉的图神经网络推荐研究[J]. 中国管理科学, 2021, 29(10): 202-212.
HU C H, DENG A, TONG X Q, et al.A graph neural network recommendation study combing trust and reputation in social e-commerce[J]. Chinese journal of management science, 2021, 29(10): 202-212.
[24] 邵云飞, 宋友, 王宝会. 基于社交网络图节点度的神经网络个性化传播算法研究[J/OL]. 计算机科学: 1-10[2023-02-08]. http://kns.cnki.net/kcms/detail/50.1075.TP.20221228.1215.008.html.
SHAO Y F, SONG Y, WANG B H.Study on personalized propagation algorithm of neural network based on graph node degree of social network[J]. Computer science: 1-10[2023-02-08]. Study on personalized propagation algorithm of neural network based on graph node degree of social network[J]. Computer science: 1-10[2023-02-08]. http://kns.cnki.net/kcms/detail/50.1075.TP.20221228.1215.008.html.
[25] 顾希之, 邵蓥侠. 基于影响力剪枝的图神经网络快速计算图精简[J]. 计算机科学, 2023, 50(1): 52-58.
GU X Z, SHAO Y X.Fast computation graph simplification via influ-ence-based pruning for graph neural network[J]. Computer science, 2023, 50(1): 52-58.
[26] 苗旭鹏, 王驭捷, 沈佳, 等. 面向多GPU的图神经网络训练加速[J/OL]. 软件学报: 1-14[2023-02-08]. DOI:10.13328/j.cnki.jos.006647.
MIAO X P, WANG N J, SHEN J, et al.Graph neural network training acceleration for Multi-GPUs[J]. Journal of software: 1-14[2023-02-08]. DOI:10.13328/j.cnki.jos.006647.
[27] 丁恒, 任卫强, 曹高辉. 基于无监督图神经网络的学术文献表示学习研究[J]. 情报学报, 2022, 41(1): 62-72.
DING H, REN W Q, CAO G H.Using unsupervised graphs of neural networks for constructing learning representations of academic papers[J]. Journal of the China society for scientific and technical information, 2022, 41(1): 62-72.
[28] 黄学坚, 刘雨飏, 马廷淮. 基于改进型图神经网络的学术论文分类模型[J]. 数据分析与知识发现, 2022, 6(10): 93-102.
HUANG X J, LIU Y Y, MA T H.Classification model for scholarly articles based on improved graph neural network[J]. Data analysis and knowledge discovery, 2022, 6(10): 93-102.
[29] 蒋昂波, 王维维. ReLU激活函数优化研究[J]. 传感器与微系统, 2018, 37(2): 50-52.
JIANG A B, WANG W W.Research on optimization of ReLU activa-tion function[J]. Transducer and microsystem technologies, 2018, 37(2): 50-52.
[30] 黄光红, 林广栋, 吴尔杰, 等. 深度神经网络Softmax函数定点算法设计[J]. 中国集成电路, 2022, 31(7): 60-64.
HUANG H L, LIN G D, WU E J, et al.Design of fixed-point algorithm for softmax of DNN[J]. China integrated circuit, 2022, 31(7): 60-64.
文章导航

/