农业图书情报学报 ›› 2024, Vol. 36 ›› Issue (9): 4-17.doi: 10.13998/j.cnki.issn1002-1248.24-0755

• AI数据体系建设专题 •    下一篇

面向主题场景的科技文献AI数据体系建设:技术框架研究与实践

常志军1,2,3, 钱力1,2,3(), 吴垚葶1,2, 曲云鹏1,2, 巩玥1,2, 张智雄1,2,3   

  1. 1. 中国科学院文献情报中心,北京 100190
    2. 中国科学院大学 经济与管理学院信息资源管理系,北京 100190
    3. 国家新闻出版署 学术期刊新型出版与知识服务重点实验室,北京 100190
  • 收稿日期:2024-07-26 出版日期:2024-09-05 发布日期:2025-01-13
  • 通讯作者: 钱力
  • 作者简介:
    常志军(1981- ),男,硕士,副研究馆员,硕士生导师,研究方向为大数据平台建设、智慧数据治理、数据挖掘等
    吴垚葶(1999- ),女,硕士研究生,研究方向为智慧数据和智慧图书馆
    曲云鹏(1980- ),男,博士,高级工程师,研究方向为数据挖掘与智慧数据治理
    巩玥(1987- ),女,硕士,副研究馆员,研究方向为生命健康领域学科情报分析
    张智雄,男,博士,博士生导师,研究方向为大数据与人工智能技术应用
  • 基金资助:
    国家社科基金项目“AI4S科技文献知识底座的理论体系及建设方法研究”(24BTQ043); 国家社科基金项目“面向循证医学的领域文献实体关系识别方法研究”(21BTQ106)

Construction of a Scientific Literature AI Data System for the Thematic Scenario: Technical Framework Research and Practice

Zhijun CHANG1,2,3, Li QIAN1,2,3(), Yaoting WU1,2, Yunpeng QU1,2, Yue GONG1,2, Zhixiong ZHANG1,2,3   

  1. 1. Documentation and Information Center, National Science Library, Chinese Academy of Sciences, Beijing 100190
    2. Department of Information Resources Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190
    3. Key Laboratory of New Publishing and Knowledge Services for Scholarly Journals, Beijing 100190
  • Received:2024-07-26 Online:2024-09-05 Published:2025-01-13
  • Contact: Li QIAN

摘要:

[目的/意义] 人工智能赋能科学研究已成为推动科学发现的重要驱动力。面向主题场景的高质量数据资源是训练高性能AI模型的关键,鉴于科技文献数据的复杂性及其直接用于大模型训练的局限性,亟须构建一套系统化的数据建设技术框架,通过对科技文献资源进行一系列的加工、提炼和整合,最终构建面向AI应用的高质量训练语料。 [方法/过程] 本研究提出了科技文献AI数据体系建设的“3+5 技术框架”,围绕AI数据体系建设全流程,提炼设计了3个层次的数据内容,以及5个阶段的数据治理过程,基于大数据技术、智能挖掘技术作为数据治理的关键要素,详细阐述了数据治理工具链的体系架构与功能。 [结果/结论] 为验证所提出的技术框架的有效性,本研究将其应用于水稻育种领域的AI数据体系构建实践中。结果表明,该框架能够有效地处理科技文献数据,构建出了高质量的领域数据集,为AI模型在水稻育种研究中的应用提供了数据支撑,验证了该技术框架的有效性和实用性。

关键词: AI数据体系, 多模态解构, 语义标注数据, 数据治理工具链, 数据特征向量化

Abstract:

[Purpose/Significance] Artificial intelligence is empowering scientific research and has become a major driver of scientific discovery. High-quality data resources for thematic scenarios are the key to training high-performance AI models. Given the complexity of scientific and technological (S&T) literature data and the limitations of its direct use for large-scale model training, there is a urgent need to build a systematic data construction technology framework to process, refine and curate S&T literature resources, and ultimately build a high-quality training corpus for AI applications. Some experts have conducted a number of studies, but there is still a lack of research on S&T literature AI data system for thematic scenarios. [Method/Process] This article proposes a "3+5 technical framework" plan for the construction of an AI data system for themed scenarios. Focusing on the whole process of AI data system construction, it refined and designed three levels of data content and five stages of data governance. The three-level data structure inclueds the multi-type basic database, the multi-model deconstruction database and fine-grained semantic mining knowledge base. The five-level construction stages are multi-channel data source scanning, multi-type basic data construction, multi-modal deconstruction data construction, fine-grained semantic mining knowledge construction and multi-scenario data application. Taking big data technology and intelligent mining technology as the key elements of data governance, the system architecture and functions of the data governance tool chain are described in detail. The core components of the tool chain are multi-source data aggregation tool, multi-format data parsing tool, data cleaning tool, associated file identification and acquisition tool, data fusion tool, multi-modal deconstruction and reorganization tool, and fine-grained knowledge identification tool. Working together, these tools ensure the efficiency and integrity of the design process from raw data to the AI data system. [Results/Conclusions] To verify the effectiveness of the proposed technical framework, this study has built a knowledge base in the field of rice breeding. The AI data system for thematic scenario of rice intelligent breeding includes a multi-type basic knowledge layer, a multi-modal deconstruction and recombination knowledge layer and a fine-grained semantic mining knowledge layer. The basic knowledge layer includes general scientific papers and patent data; the multi-modal knowledge layer includes the multi-modal data deconstruction of the paper content; the domain semantic mining knowledge layer focuses on the professional knowledge in rice intelligent breeding, such as rice variety validation data, phenotypic characteristics data, and rice lineage network. The results showed that the framework can effectively process S&T literature data and build a high-quality domain knowledge base, providing data support for the application of AI models in rice breeding research, verifying the effectiveness and practicality of the framework.

Key words: AI data system, multi-modal deconstruction, semantically annotated data, data governance tool chain, data feature quantization

中图分类号:  G250.7

引用本文

常志军, 钱力, 吴垚葶, 曲云鹏, 巩玥, 张智雄. 面向主题场景的科技文献AI数据体系建设:技术框架研究与实践[J]. 农业图书情报学报, 2024, 36(9): 4-17.

Zhijun CHANG, Li QIAN, Yaoting WU, Yunpeng QU, Yue GONG, Zhixiong ZHANG. Construction of a Scientific Literature AI Data System for the Thematic Scenario: Technical Framework Research and Practice[J]. Journal of Library and Information Science in Agriculture, 2024, 36(9): 4-17.