农业图书情报学报

• •    

面向AI4S的科技文献多模态知识抽取工具链研究

葛澜1, 黄永文1, 孔令博1, 孙坦2,3, 赵瑞雪1,4, 罗婷婷1, 鲜国建1,2()   

  1. 1.中国农业科学院农业信息研究所,北京 100081
    2.农业农村部农业大数据重点实验室,北京 100081
    3.中国农业科学院,北京 100081
    4.国家新闻出版署农业融合出版知识挖掘与知识服务重点实验室,北京 100081
  • 收稿日期:2026-04-03 出版日期:2026-06-25
  • 通讯作者: 鲜国建 E-mail:xianguojian@caas.cn
  • 作者简介:葛澜,硕士研究生,研究方向为知识组织与知识服务
    黄永文,博士,研究员,研究方向为知识组织与知识服务
    孔令博,博士研究生,副研究馆员,研究方向为知识服务与情报分析
    孙坦,博士,研究馆员(二级),研究方向为数字信息描述与组织
    赵瑞雪,博士,研究员,研究方向为农业信息管理系统
    罗婷婷,硕士,副研究员,研究方向为大数据融汇治理
  • 基金资助:
    国家社会科学基金一般项目“多模态科技资源的语义组织与关联发现服务研究”(22BTQ079);中国农业科学院农业信息研究所2026年度科技创新工程任务“创新型领军人才”(CAAS-ASTIP-2026-AII)

Multimodal Knowledge Extraction Toolchain for Scientific Literature towards AI4S

GE Lan1, HUANG Yongwen1, KONG Lingbo1, SUN Tan2,3, ZHAO Ruixue1,4, LUO Tingting1, XIAN Guojian1,2()   

  1. 1.Agricultural Information Institute, Chinese Academy of Agricultural Sciences, Beijing 100081
    2.Key Laboratory of Agricultural Big Data, Ministry of Agriculture and Rural Affairs, Beijing 100081
    3.Chinese Academy of Agricultural Sciences, Beijing 100081
    4.Key Laboratory of Knowledge Mining and Knowledge Services in Agricultural Converging Publishing, National Press and Publication Administration, Beijing 100081
  • Received:2026-04-03 Online:2026-06-25
  • Contact: XIAN Guojian E-mail:xianguojian@caas.cn

摘要:

[目的/意义] 人工智能驱动科学发现(AI4S)与大语言模型对高质量多模态语料提出迫切需求,而传统基于文献外部特征的知识组织难以满足深层知识服务,因此需要面向科技文献全文进行多模态多粒度的知识抽取,实现结构化知识单元的系统挖掘。 [方法/过程] 通过系统梳理工具、建立文献知识表示模型、设计实现管道式抽取,实现文献获取、结构解析、多模态内容抽取及存储处理。针对基础信息精度不足、学术声明结构混乱、支撑材料信息缺失三大问题,分别优化并整合进工具链。 [结果/结论] 经水稻育种领域实证与SciWatch平台验证,本研究所构建的多模态知识抽取工具链能够有效将非结构化PDF文献转化为结构化关联知识库,优化提升原工具链中对基础信息、学术声明、支撑材料部分的抽取策略模型,显著提升了知识单元抽取的精度与召回率,能够支撑大规模科技文献知识抽取,研究成果为领域知识挖掘发现及大语言模型技术演进提供了可扩展的解决方案与实践参考。

关键词: 知识抽取, 多模态知识, AI4S, 科技文献, 知识单元, 大语言模型

Abstract:

[Purpose/Significance] The deep integration of the latest technological revolution and industrial transformation has created an urgent demand for high-quality multimodal corpora for artificial intelligence-driven scientific discovery (AI4S) and large language models. Traditional coarse-grained knowledge organization methods based on documents have become insufficient for deep knowledge services. This study aims to construct a toolchain for extracting multimodal and multigranular knowledge units from scientific and technological literature, enabling the systematic mining of structured knowledge units from massive literature and enhancing the depth and efficiency of knowledge services. [Method/Process] This study conducted a systematic review of mainstream knowledge extraction tools, both domestical and international, and performed a comparative analysis and screening on dimensions such as technical principles, functional characteristics, application advantages, existing limitations, and processing efficiency. An application demand system was constructed from four levels: identification of research subjects, context tracing, content analysis, and evidence localization. Taking the field of rice breeding as an empirical scenario, a knowledge representation model for multimodal information was constructed based on the physical organizational logic of literature. Documents were divided into four major categories of 22 knowledge units: basic information subjects, structural support, material systems, and academic descriptions. The boundaries between knowledge units are clear, and there are abundant associative relationships. Integrating the extraction needs of various types of scientific and technological literature knowledge units with tool research results, a pipeline-style extraction process framework for multimodal and multigranular knowledge units has been designed. This framework implemented a pipeline-style processing framework for the entire process of document acquisition, physical structure analysis, logical structure reconstruction, multimodal content extraction, and knowledge unit fusion and storage, constructing a cascading processing toolchain from PDF original documents to semi-structured data, and then to structured knowledge. To address three major issues: insufficient accuracy of basic information, chaotic structure of academic statements, and missing information in supporting materials, GROBID domain-adaptive retraining, XML and Markdown fusion parsing, and DeepSeek large model hierarchical extraction instructions were optimized and integrated into a full-chain toolchain. [Results/Conclusions] Preliminary experiments on the toolchain have achieved good extraction of multimodal and multigranular data. In optimization experiments, overall micro-average F1 score of the header model increased by nearly 3 percentage points, significantly enhancing the model's balance and generalization ability when processing documents in diverse formats. The problems of chaotic distribution and weakened structure of academic statement information were successfully solved, achieving robust structured extraction of more than ten types of statement information such as acknowledgements, conflicts of interest, and data availability. The introduction of the large language model DeepSeek enabled deep mining and association of chart titles, formal citation sentences, and related discussion sentences in literature. The model achieved an F1 score greater than 0.99 for extracting chart titles and greater than 0.93 for recognizing formal citation sentences. Verification through the SciWatch platform demonstrates the extraction, presentation, knowledge association, and contextual coherence of charts, supporting deep literature understanding and cross-validation. The multimodal knowledge extraction toolchain for scientific literature constructed in this paper has been able to efficiently and accurately complete the automated extraction and structured application of various knowledge units in scientific literature, covering a complete toolchain, including preprocessing, multimodal information recognition, relation extraction, knowledge fusion, and storage. The research results provide a scalable solution and practical reference for the evolution of domain knowledge mining and knowledge service technology.

Key words: knowledge extraction, multimodal knowledge, AI4S, scientific literature, knowledge unit, large language model

中图分类号:  G254.9

引用本文

葛澜, 黄永文, 孔令博, 孙坦, 赵瑞雪, 罗婷婷, 鲜国建. 面向AI4S的科技文献多模态知识抽取工具链研究[J/OL]. 农业图书情报学报. https://doi.org/10.13998/j.cnki.issn1002-1248.26-0178.

GE Lan, HUANG Yongwen, KONG Lingbo, SUN Tan, ZHAO Ruixue, LUO Tingting, XIAN Guojian. Multimodal Knowledge Extraction Toolchain for Scientific Literature towards AI4S[J/OL]. Journal of library and information science in agriculture. https://doi.org/10.13998/j.cnki.issn1002-1248.26-0178.