农业图书情报学报 ›› 2023, Vol. 35 ›› Issue (3): 15-24.doi: 10.13998/j.cnki.issn1002-1248.23-0059

• 研究论文 • 上一篇    下一篇

结构信息增强的文献分类方法研究

安波1,2   

  1. 1.中国社会科学院 民族学与人类学研究所,北京 100081;
    2.中国科学院软件研究所,北京 100190
  • 收稿日期:2023-02-08 出版日期:2023-03-05 发布日期:2023-05-31
  • 作者简介:安波(1986- ),男,博士,副研究员,研究方向为自然语言处理、知识图谱。E-mail:anbo@cass.org.cn
  • 基金资助:
    国家自然科学基金项目“知识增强的中文复述识别关键技术研究”(62076233); 国家社会科学基金项目“藏汉双语藏文古籍知识图谱构建研究”(22BTQ010)

Literature Classification Methods based on Structural Information Enhancement

AN Bo1,2   

  1. 1. Institute of Ethnology and Anthropology, Chinese Academy of Social Sciences, Beijing 100081;
    2. Institute of Software, Chinese Academy of Sciences, Beijing 100190
  • Received:2023-02-08 Online:2023-03-05 Published:2023-05-31

摘要: [目的/意义]针对传统文献分类方法未能充分利用文献结构信息的问题,本文提出使用关键词-文献图网络构建文献之间的结构信息,并用于增强传统基于文献内容的分类方法。[方法/过程]本文借助图卷积神经网络建模关键词-文献图数据,学习文献在图网络中的节点表示。同时使用Bert+BiLSTM学习文献的内容表示。然后,我们将文献的节点表示与内容表示进行拼接,得到融合文献结构信息和文本语义信息的表示,并基于该表示开展文献分类。[结果/结论]实验结果表明,文献的结构信息能够提升文献分类的性能,但单一的结构信息并不能很好地实现文献分类。通过错误分析,我们发现模型在处理包含新兴交叉科学和新概念的文献时容易出现分类错误,表明模型在处理这类数据时还有一定的局限性,是未来需要继续优化的方向。

关键词: 文献分类, 图卷积神经网络, 关键词-文献图, 语义关联, 知识组织, 自然语言处理

Abstract: [Purpose/Significance] Literature classification is a fundamental task in library and information service, which is of great value for information resource management, and literature retrieval and acquisition. Deep learning-based literature classification methods are the current mainstream methods in text classification, which employ neural networks to model and use the textual content for literature classification. This approach only utilizes the information of the literature itself, but ignores the knowledge of the association between the literature. By observing the data, we found that literature in the same category tends to share more keyword information. The literature can build association networks through keywords to form structural relationships between literature. We attempt to utilize this structural in-formation to improve the performance of literature classification. [Methods/Process] This paper proposes a method that can model the structural representation of the literature and employ this representation to enhance traditional literature classification methods. Specifi-cally, we first constructed a large-scale keyword dictionary based on the collected data from about 930,000 documents. Second, we extracted the keyword set from the titles and abstracts of papers by a two-way maximum matching algorithm and constructed the keyword-literature graph data with the literature and keywords as nodes and the inclusion relationship between the documents and keywords as edges. The literature was connected with each other by keywords. Furthermore, we employed graph convolutional neural network to model the literature graph and learn the representation of literature and keywords in the keyword-literature graph. The literature representation generated by graph neural network contained the structural relationships between the literature. In addition, we employed Bert+BiLSTM to model the textual content representation of literature. Finally, the structural and textual representations of the literature were concatenated, and the classification of the literature was performed based on this representation. [Results/Conclusions] We constructed a literature classification dataset containing 423 classes and divided the training set, validation set and test set according to the ratio of 8:1:1. We conducted literature classification experiments on this dataset. The experimental results show that the structural information of literature can effectively enhance the performance of traditional literature classification methods. The results of the stripping experiments also show that the structural information alone is insufficient for the literature classification task. Through detailed analysis of the error data, we found that the model still has problems in handling some less frequent keywords and concepts. In the future, we plan to use small-sample learning methods to solve the classification problem for literature categories with less data.

Key words: literature classification, graph convolution network, keyword-literature graph, semantic association, knowledge organization, natural language processing

中图分类号: 

  • TP393

引用本文

安波. 结构信息增强的文献分类方法研究[J]. 农业图书情报学报, 2023, 35(3): 15-24.

AN Bo. Literature Classification Methods based on Structural Information Enhancement[J]. Journal of Library and Information Science in Agriculture, 2023, 35(3): 15-24.