Journal of Library and Information Science in Agriculture >
Construction of a Scientific Literature AI Data System for the Thematic Scenario: Technical Framework Research and Practice
Received date: 2024-07-26
Online published: 2025-01-13
[urpose/Significance] Artificial intelligence is empowering scientific research and has become a major driver of scientific discovery. High-quality data resources for thematic scenarios are the key to training high-performance AI models. Given the complexity of scientific and technological (S&T) literature data and the limitations of its direct use for large-scale model training, there is a urgent need to build a systematic data construction technology framework to process, refine and curate S&T literature resources, and ultimately build a high-quality training corpus for AI applications. Some experts have conducted a number of studies, but there is still a lack of research on S&T literature AI data system for thematic scenarios. [Method/Process] This article proposes a "3+5 technical framework" plan for the construction of an AI data system for themed scenarios. Focusing on the whole process of AI data system construction, it refined and designed three levels of data content and five stages of data governance. The three-level data structure inclueds the multi-type basic database, the multi-model deconstruction database and fine-grained semantic mining knowledge base. The five-level construction stages are multi-channel data source scanning, multi-type basic data construction, multi-modal deconstruction data construction, fine-grained semantic mining knowledge construction and multi-scenario data application. Taking big data technology and intelligent mining technology as the key elements of data governance, the system architecture and functions of the data governance tool chain are described in detail. The core components of the tool chain are multi-source data aggregation tool, multi-format data parsing tool, data cleaning tool, associated file identification and acquisition tool, data fusion tool, multi-modal deconstruction and reorganization tool, and fine-grained knowledge identification tool. Working together, these tools ensure the efficiency and integrity of the design process from raw data to the AI data system. [Results/Conclusions] To verify the effectiveness of the proposed technical framework, this study has built a knowledge base in the field of rice breeding. The AI data system for thematic scenario of rice intelligent breeding includes a multi-type basic knowledge layer, a multi-modal deconstruction and recombination knowledge layer and a fine-grained semantic mining knowledge layer. The basic knowledge layer includes general scientific papers and patent data; the multi-modal knowledge layer includes the multi-modal data deconstruction of the paper content; the domain semantic mining knowledge layer focuses on the professional knowledge in rice intelligent breeding, such as rice variety validation data, phenotypic characteristics data, and rice lineage network. The results showed that the framework can effectively process S&T literature data and build a high-quality domain knowledge base, providing data support for the application of AI models in rice breeding research, verifying the effectiveness and practicality of the framework.
Zhijun CHANG , Li QIAN , Yaoting WU , Yunpeng QU , Yue GONG , Zhixiong ZHANG . Construction of a Scientific Literature AI Data System for the Thematic Scenario: Technical Framework Research and Practice[J]. Journal of Library and Information Science in Agriculture, 2024 , 36(9) : 4 -17 . DOI: 10.13998/j.cnki.issn1002-1248.24-0755
1 |
张智雄, 曾建勋, 夏翠娟, 等. 回应AIGC的信息资源管理学人思考[J]. 农业图书情报学报, 2023, 35(1): 4-28.
|
2 |
马海群, 廉龙颖. 信息资源管理领域ChatGPT的研究图景——一项系统性文献综述[J]. 图书情报工作, 2024, 68(19): 114-127.
|
3 |
孙坦, 张智雄, 周力虹, 等. 人工智能驱动的第五科研范式(AI4S)变革与观察[J]. 农业图书情报学报, 2023, 35(10): 4-32.
|
4 |
张敏, 李唯, 范青. 基于语义信息的术语加权算法提升科技文献检索的准确性[J/OL]. 图书馆杂志, 2024: 1-18.
|
5 |
张智雄, 刘欢, 于改红. 构建基于科技文献知识的人工智能引擎[J]. 农业图书情报学报, 2021, 33(1): 17-31.
|
6 |
曾建勋. 科技文献数据生产要素价值释放策略思考[J]. 图书情报知识, 2024: 1-10.
|
7 |
Clarivate developer portal[EB/OL]. [2024-07-10].
|
8 |
Elsevier developer portal[EB/OL]. [2024-07-10].
|
9 |
|
10 |
|
11 |
|
12 |
|
13 |
|
14 |
|
15 |
|
16 |
|
17 |
|
18 |
|
19 |
|
20 |
Journal article tag suite[EB/OL]. [2024-07-10].
|
21 |
|
22 |
袁里驰. 基于BiLSTM-CRF的中文分词和词性标注联合方法[J]. 中南大学学报(自然科学版), 2023, 54(8): 3145-3153.
|
23 |
黄佳妮, 于丰畅. 基于表格检索和机器学习二阶段的文献表格相关文本自动识别[J]. 数字图书馆论坛, 2022(11): 34-42.
|
24 |
李英群, 李亚菲, 裴雷, 等. 基于YOLOv5-ECA-BiFPN的学术期刊文献图表识别与提取方法研究[J]. 数据分析与知识发现, 2023, 7(11): 158-171.
|
25 |
赵冠壹, 韩松花. 科技文献的多粒度知识组织研究[J]. 情报科学, 2023, 41(8): 134-138, 161.
|
26 |
刘昊坦, 刘家伟, 张帆, 等. 科技文献的多层次结构功能识别[J]. 信息资源管理学报, 2024, 14(3): 90-103.
|
27 |
任亮, 杜薇薇, 刘伟利. 面向科技文献知识元的知识图谱构建研究[J]. 情报科学, 2022, 40(9): 26-31.
|
28 |
刘成山, 杜怡然, 汪圳. 基于细粒度知识图谱的科技文献主题发现与热点分析[J]. 情报理论与实践, 2024, 47(5): 131-138.
|
29 |
陈文杰, 胡正银, 石栖, 等. 融合知识图谱与大语言模型的科技文献复杂知识对象抽取研究[J]. 现代情报, 2024: 1-20.
|
30 |
范昊, 郑小川, 热孜亚·艾海提, 等. 基于知识图谱的标准文献多维知识发现研究[J]. 情报理论与实践, 2023, 46(9): 175-184.
|
31 |
元数据注册系统[EB/OL]. [2024-07-10].
|
32 |
曹晓丽, 李涵昱, 张智雄. 科技文献挖掘分析与服务标准体系建设研究[J]. 中国科技期刊研究, 2024, 35(10): 1374-1383.
|
/
〈 |
|
〉 |