新冠领域溯源类论文筛选及全文实体标注研究

doi:10.13998/j.cnki.issn1002-1248.22-0662

Abstract

Abstract: [Purpose/Significance] Since the outbreak of COVID-19, there has been a rapid increase in the number of studies related to COVID-19 at home and abroad. Review of relevant literature on COVID-19 provides data resources for related research on the emergence and transmission mechanism of SARS-CoV-2. However, the current COVID-19 related dataset is a collection of the literature, without classifying the data for each subfield, and the coarse-grained information such as the title and author fails to provide an in-depth understanding of the progress of COVID-19 research. Therefore, this paper created a dataset for the COVID-19 sub-domain and a fine-grained entity dataset. [Method/Process] Firstly, this paper proposed a literature screening method based on active learning model, which can obtain more valuable marker samples with less labor cost, so that the classifier has better generalization performance. We considered three base classifiers: Support Vector Machine (SVM), Logistic Regression (LR), and Random Forest (RF), while considering four query strategies: uncertainty sampling, expected error reduction, committee-based query, and random sampling. Taking the origin of SARS-CoV-2, one of the sub-fields related to SARS-CoV-2, as an example, articles related to the origin of SARS-CoV-2 were efficiently and accurately located from the literature. At the same time, this paper designed a labeling scheme covering 18 types of entities, including not only genes, proteins, compounds and other entities that are universal in the biological field, but also corona viruses and wild animals that are unique to the field of SARS-CoV-2. In this paper, visual annotation tool BRAT was used for entity annotation. The tagging team consisted of an administrator and six annotators, and the entity tagging consisted of two rounds. What's more, multi-k consistency index was used to calculate the consistency score of annotation results. [Results/Conclusions] The results of the active learning model show that the uncertain sampling query strategy has the best performance. SVM, LR and RF based on uncertain sampling can correctly screen 425, 465 and 489 articles, respectively. After the removal of overlapping articles, a dataset related to the origin of SARS-CoV-2 was constructed, containing a total of 885 articles. Secondly, based on the proposed entity labeling scheme, 6 annotators completed 99 papers. Based on the results of fine marking, this paper constructed an entity dataset containing 39,118 entities, which is the largest and most comprehensive entity corpus in the field of COVID-19.

Key words: SARS-CoV-2, data collection, origins of SARS-CoV-2, document screening, entity annotation

CLC Number:

G255.51

XU Shuo, ZHANG Mengmeng, LIU Liyuan, WANG Congcong, SUN Rui, LI Yilin, XU Jinnan, AN Xin. Selection of Papers on the Origins of COVID-19 and Entity Annotation Based on Full Texts[J].Journal of library and information science in agriculture, 2023, 35(1): 87-98.

References

[1] MEI-HSIU-CHING H, LIU JOHN-S. The swift knowledge development path of COVID-19 research: The first 150 days[J]. Scientometrics, 2021, 126(3): 2391-2399.
[2] LUCY L, LO K, CHANDRASEKHAR Y, et al.CORD-19: The covid-19 open research dataset[J]. ArXiv: 2004.10706v4, 2020.
[3] CHEN Q, ALLOT A, LU Z.LitCovid: An open database of COVID-19 literature[J]. Nucleic acids res, 2021, 49(d1): D1534-D1540.
[4] XU B, GUTIERREZ B, MEKARU S, et al.Epidemiological data from the covid-19 outbreak, real-time case information[J]. Scientific data, 2021, 7(1): 1-6.
[5] DONG E, DU H, GARDNER L.An interactive web-based dashboard to track covid-19 in realtime[J]. The lancet infectious diseases, 2020, 20(5): 533-534.
[6] KABIR M, MADRIA S.Coronavis: A real-time covid-19 tweets analyzer[J]. ArXiv: 2004.10706v4, 2020.
[7] 杨崇洛, 生龙, 魏忠诚, 等. 新冠文本实体关系抽取及数据集构建方法研究[J/OL]. 计算机工程与应用: 1-9[2023-02-06]. http://kns.cnki.net/kcms/detail/11.2127.tp.20220622.1100.010.html.
YANG C L, SHENG L, WEI Z C, et al. Research on COVID-19 text entity relation extraction and dataset construction methods[J/OL]. Computer engineering and applications: 1-9[2023-02-06]. http://kns.cnki.net/kcms/detail/11.2127.tp.20220622.1100.010.html.
[8] DOMINGO-FERNáNDEZ D, BAKSI S, SCHULTZ B, et al. Covid-19 knowledge graph: A computable, multi-modal, cause-and-effect knowledge model of covid-19 pathophysiology[J]. Bioinformatics, 2021, 37(9): 1332-1334.
[9] ZHANG R, HRISTOVSKI D, SCHUTTE D, et al.Drug repurposing for COVID-19 via knowledge graph completion[J]. Journal of biomedical informatics, 2021, 115: 103696.
[10] GROSSMAN M, CORMACK G, ROEGIEST A.TREC 2016 total recall track overview[C]. Proceedings of the twenty-fifth text retrieval conference, 2016: 15-18.
[11] COUNSELL C.Formulating questions and locating primary studies for inclusion in systematic reviews[J]. Annals of internal medicine, 1997, 127(5): 380-387.
[12] CARVALLO A, PARRA D, LOBEL H, et al.Automatic document screening of medical literature using word and text embeddings in an active learning setting[J]. Scientometrics, 2020, 125(3): 3047-3084.
[13] HASSLER E, HALE D, HALE J.A comparison of automated training-by-example selection algorithms for evidence based software engineering[J]. Information and software technology, 2018(98): 59-73.
[14] CORMACK G, GROSSMAN M.Evaluation of Machine-Learning protocols for technology -Assisted review in electronic discovery[C]. Proceedings of the 37th international ACM SIGIR conference on research and development in information retrieval, 2014: 153-162.
[15] ROEGIEST A, CORMACK G, GROSSMAN M, et al. TREC2015 to-tal recall track overview[C]. Proceedings of the 24th text REtrieval-conference(TREC 2015), 2015.
[16] KANOULAS E, LI D, AZZOPARDI L, et al.CLEF 2018 technolog-ically assisted reviews in empirical medicine overview[C]. CEUR workshop proceedings, 2018: 10-14.
[17] DONOSO-GUZMáN I, PARRA D. An interactive relevance feedback interface for evidence-based health care[C]. The 23rd international conference on intelligent user interfaces, 2014: 103-114.
[18] WENG L, LI Z, CAI R, et al.Query by document via a decomposition-based two-level retrieval approach[C]. Proceedings of the 34th international ACM SIGIR conference on research and development in information retrieval, 2011: 505-514.
[19] LEE G, SUN A.Seed-driven document ranking for systematic reviews in evidence-based medicine[C]. The 41st international ACM SIGIR conference on research & development in information retrieval, 2018: 455-464.
[20] XU S.Bayesian na ve bayes classifiers to text classification[J]. Jour-nal of information science, 2018, 44(1): 48-59.
[21] XU S, AN X, QIAO X, et al.Multi-task least-squares support vector machines[J]. Multimedia tools and applications, 2014, 71(2): 699-715.
[22] AN X, SUN X, XU S, et al.Important citations identification by exploiting generative model into discriminative model[J]. Journal of
information science, 2023, 49(1): 107-121.
[23] SETTLES B.Active learning literature survey[R]. University of Wisconsin-Madison, Madison, USA: Computer sciences technical report 1648, 2010.
[24] 沙九, 冯冲, 周鹭琴, 等. 面向司法领域的高质量开源藏汉平行语料库构建[J]. 中文信息学报, 2021, 35(11): 51-59.
SHA J, FENG C, ZHOU J Q, et al.Constraction of high-quality and open source Tibetan-Chinese parallel corpus judicial domain[J]. Journal of Chinese information processing, 2021, 35(11): 51-59.
[25] 冯鸾鸾, 李军辉, 李培峰, 等. 面向国防科技领域的技术和术语语料库构建方法[J]. 中文信息学报, 2020, 34(8): 41-50.
FENG L L, LI J H, LI P F, et al.Constructing a technology and terminology corpus oriented national defense science[J]. Journal of Chinese information processing, 2020, 34(8): 41-50.
[26] 刘妍, 熊德意. 面向小语种机器翻译的平行语料库构建方法[J]. 计算机科学, 2022, 49(1): 41-46.
LIU Y, XIONG D Y.Construction method of parallel corpus for minority language machine translation[J]. Computer science, 2022, 49(1): 41-46.
[27] 杨锦锋, 关毅, 何彬, 等. 中文电子病历命名实体和实体关系语料库构建[J]. 软件学报, 2016, 27(11): 2725-2746.
YANG J F, GUAN Y, HE B, et al.Corpus construction for named entities and entity relations on Chinese electronic medical records[J]. Journal of software, 2016, 27(11): 2725-2746.
[28] WEI C, KAO H, LU Z.GNormPlus: An integrative approach for tagging genes, gene families, and protein domains[J]. BioMed research international, 2015: 918710.
[29] ISLAMAJ R, WEI C, CISSEL D, et al.NLM-Gene, a richly annotated gold standard dataset for gene entities that addresses ambiguity and multi-species gene recognition[J]. Journal of biomedical informatics, 2021, 118: 103779.
[30] KRALLINGER M, RABAL O, LEITNER F, et al.The CHEMDNER corpus of chemicals and drugs and its annotation principles[J]. Journal of cheminformatics, 2015, 7(1): S2.
[31] XU S, AN X, ZHU L, et al.A CRF-based system for recognizing chemical entity mentions (CEMs) in biomedical literature[J]. Journal of cheminformatics, 2015, 7(1): S11.
[32] LI J, SUN Y, JOHNSON R, et al.BioCreative v CDR task corpus: A resource for chemical disease relation extraction[J]. Database(oxford), 2016: Baw068.
[33] ISLAMAJ R, LEAMAN R, KIM S, et al.NLM-Chem, a new resource for chemical entity recognition in PubMed full text literature[J]. Scientific data, 2021, 8(1): 91.
[34] PAFILIS E, FRANKILD S, FANINI L, et al.The SPECIES and ORGANISMS resources for fast and accurate identification of taxonomic names in text[J]. PLoS one, 2013, 8(6): E65390.
[35] PYYSALO S, OHTA T, ANANIADOU S.Overview of the cancer genetics (cg) task of bionlp shared task 2013[C]. Proceedings of the BioNLP shared task 2013 workshop, 2013: 58-66.
[36] BADA M, ECKERT M, EVANS D, et al.Concept annotation in the CRAFT corpus[J]. BMC bioinformatics, 2012, 13: 161.
[37] Joint WHO-China study team. WHO-convened global study of origins of SARS-CoV-2: China part[EB/OL].[2022-03-10]. https://www.who.int/publications/i/item/who-convened-global-study-of-origins-of-sars-cov-2-china-part.
[38] DOMINGO J.What we know and what we need to know about the origin of SARS-CoV-2[J]. Environmental research, 2021, 200: 111785.
[39] HOLMES E, GOLDSTEIN S, RASMUSSEN A, et al.The origins of SARS-CoV-2: A critical review[J]. Cell, 2021, 184(19): 4848-4856.
[40] VAN HELDEN J, BUTLER C, ACHAZ G, et al.An appeal for an objective, open, and transparent scientific debate about the origin of SARS-CoV-2[J]. Lancet, 2021, 398(10309): 1402-1404.
[41] KARLSSON E, DUONG V.The continuing search for the origins of SARS-CoV-2[J]. Cell, 2021, 184(17): 4373-4374.
[42] LEITNER T, KUMAR S.Where did SARS-CoV-2 come from[J]. Molecular biology and evolution, 2020, 37(9): 2463-2464.
[43] DAVIES M, FLEISS J.Measuring agreement for multinomial data[J]. Biometrics, 1982, 38(4): 1047-1051.
[44] COHAN A, FELDMAN S, BELTAGY I, et al.SPECTER: Document-level representation learning using citation-informed transformers[C]. Proceedings of the 58th annual meeting of the association for computational linguistics, 2020: 2270-2282.
[45] PONTUS S, SAMPO P, GORAN T, et al.BRAT: A web-based tool for NLP-assisted text annotation[C]. Proceedings of the demonstrations at the 13th conference of the European chapter of the association for computational linguistics, 2012: 102-107.
[46] WANG X, SONG X, LI B, et al.Comprehensive named entity recognition on CORD-19 with distant or weak supervision[J]. ArXiv: 2003.12218v5, 2020.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Selection of Papers on the Origins of COVID-19 and Entity Annotation Based on Full Texts

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 2

Metrics

Comments

Recommended 0

[1]	SUN Yusheng, FAN Ying, ZHU Bo. Research Advances in Resource Management Technology of Smart Recommendation Enabled by Big Data in China [J]. Journal of library and information science in agriculture, 2023, 35(12): 4-17.
[2]	ZHAO Shuai, ZHOU Dan. Analysis on the Epidemic Situation of COVID-19 in Six Provinces Adjacent to Hubei [J]. Journal of library and information science in agriculture, 2020, 32(4): 5-14.