中文    English

Journal of Library and Information Science in Agriculture ›› 2024, Vol. 36 ›› Issue (9): 4-17.doi: 10.13998/j.cnki.issn1002-1248.24-0755

    Next Articles

Construction of a Scientific Literature AI Data System for the Thematic Scenario: Technical Framework Research and Practice

Zhijun CHANG1,2,3, Li QIAN1,2,3(), Yaoting WU1,2, Yunpeng QU1,2, Yue GONG1,2, Zhixiong ZHANG1,2,3   

  1. 1. Documentation and Information Center, National Science Library, Chinese Academy of Sciences, Beijing 100190
    2. Department of Information Resources Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190
    3. Key Laboratory of New Publishing and Knowledge Services for Scholarly Journals, Beijing 100190
  • Received:2024-07-26 Online:2024-09-05 Published:2025-01-13
  • Contact: Li QIAN

Abstract:

[Purpose/Significance] Artificial intelligence is empowering scientific research and has become a major driver of scientific discovery. High-quality data resources for thematic scenarios are the key to training high-performance AI models. Given the complexity of scientific and technological (S&T) literature data and the limitations of its direct use for large-scale model training, there is a urgent need to build a systematic data construction technology framework to process, refine and curate S&T literature resources, and ultimately build a high-quality training corpus for AI applications. Some experts have conducted a number of studies, but there is still a lack of research on S&T literature AI data system for thematic scenarios. [Method/Process] This article proposes a "3+5 technical framework" plan for the construction of an AI data system for themed scenarios. Focusing on the whole process of AI data system construction, it refined and designed three levels of data content and five stages of data governance. The three-level data structure inclueds the multi-type basic database, the multi-model deconstruction database and fine-grained semantic mining knowledge base. The five-level construction stages are multi-channel data source scanning, multi-type basic data construction, multi-modal deconstruction data construction, fine-grained semantic mining knowledge construction and multi-scenario data application. Taking big data technology and intelligent mining technology as the key elements of data governance, the system architecture and functions of the data governance tool chain are described in detail. The core components of the tool chain are multi-source data aggregation tool, multi-format data parsing tool, data cleaning tool, associated file identification and acquisition tool, data fusion tool, multi-modal deconstruction and reorganization tool, and fine-grained knowledge identification tool. Working together, these tools ensure the efficiency and integrity of the design process from raw data to the AI data system. [Results/Conclusions] To verify the effectiveness of the proposed technical framework, this study has built a knowledge base in the field of rice breeding. The AI data system for thematic scenario of rice intelligent breeding includes a multi-type basic knowledge layer, a multi-modal deconstruction and recombination knowledge layer and a fine-grained semantic mining knowledge layer. The basic knowledge layer includes general scientific papers and patent data; the multi-modal knowledge layer includes the multi-modal data deconstruction of the paper content; the domain semantic mining knowledge layer focuses on the professional knowledge in rice intelligent breeding, such as rice variety validation data, phenotypic characteristics data, and rice lineage network. The results showed that the framework can effectively process S&T literature data and build a high-quality domain knowledge base, providing data support for the application of AI models in rice breeding research, verifying the effectiveness and practicality of the framework.

Key words: AI data system, multi-modal deconstruction, semantically annotated data, data governance tool chain, data feature quantization

CLC Number: 

  • G250.7

Fig.1

Data structure of scientific literature AI data system for thematic scenario"

Fig.2

Technique structure of scientific literature AI data system for thematic scenario"

Fig.3

Design of a tool chain of scientific literature AI data system for thematic scenario"

Fig.4

Sample of deconstruction data of multi-model knowledge objects in articles"

Fig.5

Sample of deconstruction data of multi-model knowledge objects in patents"

Fig.6

Data sample of rice pedigree relationship"

1
张智雄, 曾建勋, 夏翠娟, 等. 回应AIGC的信息资源管理学人思考[J]. 农业图书情报学报, 2023, 35(1): 4-28.
ZHANG Z X, ZENG J X, XIA C J, et al. Information resource management researchers' thinking about the opportunities and challenges of AIGC[J]. Journal of library and information science in agriculture, 2023, 35(1): 4-28.
2
马海群, 廉龙颖. 信息资源管理领域ChatGPT的研究图景——一项系统性文献综述[J]. 图书情报工作, 2024, 68(19): 114-127.
MA H Q, LIAN L Y. Research landscape of ChatGPT in the field of information resources management: A systematic literature review[J]. Library and information service, 2024, 68(19): 114-127.
3
孙坦, 张智雄, 周力虹, 等. 人工智能驱动的第五科研范式(AI4S)变革与观察[J]. 农业图书情报学报, 2023, 35(10): 4-32.
SUN T, ZHANG Z X, ZHOU L H, et al. The transformation and observations of AI for science(AI4S) driven by artificial intelligence[J]. Journal of library and information science in agriculture, 2023, 35(10): 4-32.
4
张敏, 李唯, 范青. 基于语义信息的术语加权算法提升科技文献检索的准确性[J/OL]. 图书馆杂志, 2024: 1-18.
ZHANG M, LI W, FAN Q. Improving the accuracy of scientific literature retrieval through term weighting algorithms based on semantic information[J/OL]. Library journal, 2024: 1-18.
5
张智雄, 刘欢, 于改红. 构建基于科技文献知识的人工智能引擎[J]. 农业图书情报学报, 2021, 33(1): 17-31.
ZHANG Z X, LIU H, YU G H. Building an artificial intelligence engine based on scientific and technological literature knowledge[J]. Journal of library and information science in agriculture, 2021, 33(1): 17-31.
6
曾建勋. 科技文献数据生产要素价值释放策略思考[J]. 图书情报知识, 2024: 1-10.
ZENG J X. Strategies for releasing the value of scientific and technologicalliterature data production factors[J/OL]. Documentation,information & knowledge, 2024: 1-10.
7
Clarivate developer portal[EB/OL]. [2024-07-10].
8
Elsevier developer portal[EB/OL]. [2024-07-10].
9
ADILA N. Implementation of web scraping for journal data collection on the SINTA website[J]. Sinkron, 2022, 7(4): 2478-2485.
10
MUTLU M A, ULKU E E, YILDIZ K. A web scraping app for smart literature search of the keywords[J]. PeerJ computer science, 2024, 10: e2384.
11
SINGH I, SATYAM, SEMWAL A, et al. Text processing and analysis pipeline for scientific literature[C]//2024 International Conference on Advances in Computing, Communication and Applied Informatics (ACCAI). Piscataway, New Jersey: IEEE, 2024: 1-5.
12
CHEN J, LING M, LI R, et al. VIS30K: A collection of figures and tables from IEEE visualization conference publications[C]//IEEE Transactions on Visualization and Computer Graphics. Piscataway, New Jersey: IEEE, 2021: 3826-3833.
13
MISHRA P, KUMAR S, CHAUBE M K. Evaginating scientific charts: Recovering direct and derived information encodings from chart images[J]. Journal of visualization, 2022, 25(2): 343-359.
14
MIRKAZEMY A, ADIBI P, EHSANI S M S, et al. Mathematical expression recognition using a new deep neural model[J]. Neural networks, 2023, 167: 865-874.
15
GEMELLI A, VIVOLI E, MARINAI S. Graph neural networks and representation embedding for table extraction in PDF documents[C]//2022 26th International Conference on Pattern Recognition (ICPR). Piscataway, New Jersey: IEEE, 2022: 1719-1726.
16
CAMMARANO A, VARRIALE V, MICHELINO F, et al. A framework for investigating the adoption of key technologies: Presentation of the methodology and explorative analysis of emerging practices[J]. IEEE transactions on engineering management, 2024, 71: 3843-3866.
17
VERMA S, BHATIA R, HARIT S, et al. Scholarly knowledge graphs through structuring scholarly communication: A review[J]. Complex & intelligent systems, 2023, 9(1): 1059-1095.
18
DIAZ GONZALEZ A D, HUGHES K S, YUE S H, et al. Applying BioBERT to extract germline gene-disease associations for building a knowledge graph from the biomedical literature[C]//2023 the 7th International Conference on Information System and Data Mining (ICISDM). New York: ACM, 2023: 37-42.
19
DESSÍ D, OSBORNE F, REFORGIATO RECUPERO D, et al. CS-KG: A large-scale knowledge graph of research entities and claims in computer science[M]//SATTLER U, HOGAN A, KEET M, et al, eds. Lecture Notes in Computer Science. Cham: Springer International Publishing, 2022: 678-696.
20
Journal article tag suite[EB/OL]. [2024-07-10].
21
CHOUDHURY M H, SALSABIL L, JAYANETTI H R, et al. MetaEnhance: Metadata quality improvement for electronic theses and dissertations of university libraries[C]//2023 ACM/IEEE Joint Conference on Digital Libraries (JCDL). Piscataway, New Jersey: IEEE, 2023: 61-65.
22
袁里驰. 基于BiLSTM-CRF的中文分词和词性标注联合方法[J]. 中南大学学报(自然科学版), 2023, 54(8): 3145-3153.
YUAN L C. A joint method for Chinese word segmentation and part-of-speech tagging based on BiLSTM-CRF[J]. Journal of central south university (science and technology), 2023, 54(8): 3145-3153.
23
黄佳妮, 于丰畅. 基于表格检索和机器学习二阶段的文献表格相关文本自动识别[J]. 数字图书馆论坛, 2022(11): 34-42.
HUANG J N, YU F C. Automatic recognition of table-related text in literature based on table retrieval and machine learning two-stage method[J]. Digital library forum, 2022(11): 34-42.
24
李英群, 李亚菲, 裴雷, 等. 基于YOLOv5-ECA-BiFPN的学术期刊文献图表识别与提取方法研究[J]. 数据分析与知识发现, 2023, 7(11): 158-171.
LI Y Q, LI Y F, PEI L, et al. Identifying and extracting figures and tables from academic literature based on YOLOv5-ECA-BiFPN[J]. Data analysis and knowledge discovery, 2023, 7(11): 158-171.
25
赵冠壹, 韩松花. 科技文献的多粒度知识组织研究[J]. 情报科学, 2023, 41(8): 134-138, 161.
ZHAO G Y, HAN S H. Multi-granularity knowledge organization of sci-tech literature[J]. Information science, 2023, 41(8): 134-138, 161.
26
刘昊坦, 刘家伟, 张帆, 等. 科技文献的多层次结构功能识别[J]. 信息资源管理学报, 2024, 14(3): 90-103.
LIU H T, LIU J W, ZHANG F, et al. Multi-level functional structure recognition of scientific literature[J]. Journal of information resources management, 2024, 14(3): 90-103.
27
任亮, 杜薇薇, 刘伟利. 面向科技文献知识元的知识图谱构建研究[J]. 情报科学, 2022, 40(9): 26-31.
REN L, DU W W, LIU W L. The construction of knowledge graph for knowledge elements of scientific literature[J]. Information science, 2022, 40(9): 26-31.
28
刘成山, 杜怡然, 汪圳. 基于细粒度知识图谱的科技文献主题发现与热点分析[J]. 情报理论与实践, 2024, 47(5): 131-138.
LIU C S, DU Y R, WANG Z. Topic discovery and hotspot analysis of scientific literature based on fine-gained knowledge graph[J]. Information studies: Theory & application, 2024, 47(5): 131-138.
29
陈文杰, 胡正银, 石栖, 等. 融合知识图谱与大语言模型的科技文献复杂知识对象抽取研究[J]. 现代情报, 2024: 1-20.
CHEN W J, HU Z Y, SHI X, et al. Research on scientific and technological literature complexknowledge object extraction fusing knowledge graph and largelanguage[J/OL]. Journal of modern information, 2024: 1-20.
30
范昊, 郑小川, 热孜亚·艾海提, 等. 基于知识图谱的标准文献多维知识发现研究[J]. 情报理论与实践, 2023, 46(9): 175-184.
FAN H, ZHENG X C, REZIYA A, et al. Research on multidimensional knowledge discovery of standards based on knowledge graph[J]. Information studies: Theory & application, 2023, 46(9): 175-184.
31
元数据注册系统[EB/OL]. [2024-07-10].
32
曹晓丽, 李涵昱, 张智雄. 科技文献挖掘分析与服务标准体系建设研究[J]. 中国科技期刊研究, 2024, 35(10): 1374-1383.
CAO X L, LI H Y, ZHANG Z X. Construction of a standard system for mining, analysis, and service of scientific literature[J]. Chinese journal of scientific and technical periodicals, 2024, 35(10): 1374-1383.
[1] Mingjie ZHANG, Ruixue ZHAO. Emotion Perception and Service Optimization in ChatGPT-Driven Smart Libraries [J]. Journal of Library and Information Science in Agriculture, 2025, (): 1-15.
[2] Jiaxin HUANG, Xiaofang ZHANG. Application Models and Innovative Approaches of Smart Libraries from the Perspective of MR Technology [J]. Journal of Library and Information Science in Agriculture, 2024, 36(9): 78-88.
[3] Keyi XIAO, Yingying CHEN. Scientific Data Management Based on a Data Life Cycle Perspective: Using the Institutional Repositories Base of 24 Universities in the United States as an Example [J]. Journal of Library and Information Science in Agriculture, 2024, 36(7): 88-99.
[4] Jia LIU. Innovation and Risk Avoidance of Smart Library Services Based on Generative Artificial Intelligence [J]. Journal of Library and Information Science in Agriculture, 2024, 36(7): 63-75.
[5] Mo LI, Bin YANG. From Generative Artificial Intelligence to Artificial General Intelligence: Enabling Innovation Models in Library Knowledge Services [J]. Journal of Library and Information Science in Agriculture, 2024, 36(6): 50-61.
[6] Qiaofei CHEN, Haomin ZHOU, Xin XU. Digital-Intelligence Empowers Cultural Heritage Protection and Inheritance:Taking the International Communication of Chinese Tea Culture as an Example [J]. Journal of Library and Information Science in Agriculture, 2024, 36(6): 62-78.
[7] JIANG Ye, LIU Qiong, LIU Guifeng. Construction of the Cultivation Framework of AI Generated Content on the Information Culture of University Libraries [J]. Journal of Library and Information Science in Agriculture, 2024, 36(4): 36-44.
[8] ZHOU Xin. Machine Functionalism and the Digital-Intelligence Divide: Evolutionary Pathways, Generative Logic and Regulatory Strategies [J]. Journal of Library and Information Science in Agriculture, 2024, 36(3): 59-71.
[9] WANG Sili, ZHANG Ling, YANG Heng, LIU Wei. Review of Deep Learning for Language Modeling [J]. Journal of Library and Information Science in Agriculture, 2023, 35(8): 4-18.
[10] NIU Xianyun. A Functional Framework for a Library's Mobile Reading Service [J]. Journal of Library and Information Science in Agriculture, 2023, 35(8): 55-65.
[11] FU Rongxin, YANG Xiaohua. Analysis of AIGC Language Models and Application Scenarios in University Libraries [J]. Journal of Library and Information Science in Agriculture, 2023, 35(7): 27-38.
[12] Guo Limin. The Practice and Enlightment of Cloud Collaboration in Libraries: A Case Study of the Mobile Library App of Shanghai Library [J]. Journal of Library and Information Science in Agriculture, 2023, 35(6): 93-102.
[13] JIN Jiaqin. Practical Exploration on the Web Scale Discovery Service Based on Micro-Service Architecture [J]. Journal of Library and Information Science in Agriculture, 2023, 35(5): 89-100.
[14] XIAO Keyi, LI Yunfan. Present Situation and Enlightenment of Chinese University Libraries' Participation in Digital Humanistic Educational Service from the Perspective of Supply and Demand Matching [J]. Journal of Library and Information Science in Agriculture, 2023, 35(5): 37-50.
[15] YANG Xiaofei, KONG Yuefan, SUN Jipu. Design and Analysis of a Blockchain Security Framework Model for Smart Libraries [J]. Journal of Library and Information Science in Agriculture, 2023, 35(4): 79-89.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!