农业图书情报学报 ›› 2023, Vol. 35 ›› Issue (1): 87-98.doi: 10.13998/j.cnki.issn1002-1248.22-0662

• 研究论文 • 上一篇    下一篇

新冠领域溯源类论文筛选及全文实体标注研究

徐硕1, 张萌萌2, 柳力元2, 王聪聪1, 孙睿2, 李怡琳2, 徐金楠2, 安欣2,*   

  1. 1.北京工业大学 经济与管理学院,北京 100124;
    2.北京林业大学 经济管理学院,北京 100083
  • 收稿日期:2022-09-07 出版日期:2023-01-05 发布日期:2023-03-23
  • 通讯作者: *安欣(1980- ),女,博士,教授,硕士生导师,研究方向为科学计量学和数据挖掘等。Email:anxin@bjfu.edu.cn
  • 作者简介:徐硕(1979- ),男,博士,教授,博士生导师,研究方向为科学计量学、科技情报分析和数据挖掘等。张萌萌(1997- ),女,硕士,研究方向为科技情报分析。柳力元(1998- ),女,硕士,研究方向为知识扩散。王聪聪(1999- ),女,硕士,研究方向为科学计量学。孙睿(1996- ),女,硕士研究生,研究方向为知识扩散。李怡琳(1998- ),女,硕士,研究方向为科学计量学。徐金楠(1998- ),女,硕士,研究方向为科学计量学
  • 基金资助:
    国家自然科学基金项目“基于全文本的微观实体扩散机制研究”(72004012); 北京工业大学2022年度“研究生思政教育进科研团队——抗疫专项探索项目”

Selection of Papers on the Origins of COVID-19 and Entity Annotation Based on Full Texts

XU Shuo1, ZHANG Mengmeng2, LIU Liyuan2, WANG Congcong1, SUN Rui2, LI Yilin2, XU Jinnan2, AN Xin2,*   

  1. 1. School of Economics and Management, Beijing University of Technology, Beijing 100124;
    2. School of Economics and Management, Beijing Forestry University, Beijing 100083
  • Received:2022-09-07 Online:2023-01-05 Published:2023-03-23

摘要: [目的/意义]新冠病毒出现以来,国内外与新冠病毒研究相关的论文迅猛增长。整理国内外COVID-19相关学术论文,创建针对新冠溯源类论文的数据集和细粒度的实体数据集能为新冠病毒的起源和传播机理等相关研究提供坚实的数据支撑。[方法/过程]提出基于主动学习模型的论文筛选方法,从海量论文中高效精准地定位与新冠溯源相关的论文。同时,设计了一种新冠领域18类实体的标注方案,不仅包含生物领域通有的基因、蛋白质和化合物等实体,还涵盖新冠领域特有的冠状病毒、野生动物等实体。[结果/结论]构建了一个新冠溯源类论文数据集,共包含885篇文章;基于提出的实体标注方案,标注全文本论文99篇,构建了一个细粒度的实体数据集,包含39 118个实体,是目前新冠领域规模最大、最全面的实体标注数据集。

关键词: 新冠病毒, 数据收集, SARS-CoV-2起源, 文档筛选, 实体标注

Abstract: [Purpose/Significance] Since the outbreak of COVID-19, there has been a rapid increase in the number of studies related to COVID-19 at home and abroad. Review of relevant literature on COVID-19 provides data resources for related research on the emergence and transmission mechanism of SARS-CoV-2. However, the current COVID-19 related dataset is a collection of the literature, without classifying the data for each subfield, and the coarse-grained information such as the title and author fails to provide an in-depth understanding of the progress of COVID-19 research. Therefore, this paper created a dataset for the COVID-19 sub-domain and a fine-grained entity dataset. [Method/Process] Firstly, this paper proposed a literature screening method based on active learning model, which can obtain more valuable marker samples with less labor cost, so that the classifier has better generalization performance. We considered three base classifiers: Support Vector Machine (SVM), Logistic Regression (LR), and Random Forest (RF), while considering four query strategies: uncertainty sampling, expected error reduction, committee-based query, and random sampling. Taking the origin of SARS-CoV-2, one of the sub-fields related to SARS-CoV-2, as an example, articles related to the origin of SARS-CoV-2 were efficiently and accurately located from the literature. At the same time, this paper designed a labeling scheme covering 18 types of entities, including not only genes, proteins, compounds and other entities that are universal in the biological field, but also corona viruses and wild animals that are unique to the field of SARS-CoV-2. In this paper, visual annotation tool BRAT was used for entity annotation. The tagging team consisted of an administrator and six annotators, and the entity tagging consisted of two rounds. What's more, multi-k consistency index was used to calculate the consistency score of annotation results. [Results/Conclusions] The results of the active learning model show that the uncertain sampling query strategy has the best performance. SVM, LR and RF based on uncertain sampling can correctly screen 425, 465 and 489 articles, respectively. After the removal of overlapping articles, a dataset related to the origin of SARS-CoV-2 was constructed, containing a total of 885 articles. Secondly, based on the proposed entity labeling scheme, 6 annotators completed 99 papers. Based on the results of fine marking, this paper constructed an entity dataset containing 39,118 entities, which is the largest and most comprehensive entity corpus in the field of COVID-19.

Key words: SARS-CoV-2, data collection, origins of SARS-CoV-2, document screening, entity annotation

中图分类号: 

  • G255.51

引用本文

徐硕, 张萌萌, 柳力元, 王聪聪, 孙睿, 李怡琳, 徐金楠, 安欣. 新冠领域溯源类论文筛选及全文实体标注研究[J]. 农业图书情报学报, 2023, 35(1): 87-98.

XU Shuo, ZHANG Mengmeng, LIU Liyuan, WANG Congcong, SUN Rui, LI Yilin, XU Jinnan, AN Xin. Selection of Papers on the Origins of COVID-19 and Entity Annotation Based on Full Texts[J]. Journal of Library and Information Science in Agriculture, 2023, 35(1): 87-98.