农业图书情报学报 ›› 2025, Vol. 37 ›› Issue (2): 4-22.doi: 10.13998/j.cnki.issn1002-1248.25-0116

• 特约综述 •    下一篇

大语言模型赋能科技文献数据挖掘进展分析

蔡祎然1,2, 胡正银1,2(), 刘春江1,2   

  1. 1. 中国科学院成都文献情报中心,成都 610299
    2. 中国科学院大学 经济与管理学院信息资源管理系,北京 100190
  • 收稿日期:2025-01-06 出版日期:2025-02-05 发布日期:2025-05-20
  • 通讯作者: 胡正银
  • 作者简介:

    蔡祎然(2001- ),女,硕士研究生,研究方向为科技文献数据挖掘、学科知识发现

    刘春江(1984- ),男,博士,高级工程师,研究方向为科技文献数据挖掘

  • 基金资助:
    国家自然科学基金重大研究计划重点项目“支持下一代人工智能的通用型高质量科学数据库”(92470204)

Analysis of Progress in Data Mining of Scientific Literature Using Large Language Models

CAI Yiran1,2, HU Zhengyin1,2(), LIU Chunjiang1,2   

  1. 1. National Science Library (Chengdu), Chinese Academy of Sciences, Chengdu 610299
    2. Department of Information Resources Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190
  • Received:2025-01-06 Online:2025-02-05 Published:2025-05-20
  • Contact: HU Zhengyin

摘要:

[目的/意义] 科技文献蕴含丰富的领域知识与科学数据,可为人工智能驱动的科学研究(AI for Science,AI4S)提供高质量数据支撑。本文系统梳理大语言模型(Large Language Models,LLMs)在科技文献数据挖掘中的方法技术、软件工具及应用场景,探讨其研究方向与发展趋势。 [方法/过程] 本文基于文献调研与归纳总结,在方法技术层面,从文本知识、科学数据与图表信息分析了LLMs驱动的科技文献细粒度数据挖掘关键技术以及综合性知识生成的方法;在软件工具层面,归纳了主流LLMs科技文献数据挖掘与知识生成工具的方法技术、核心功能和适用场景;在应用场景层面,分析了科技文献数据挖掘应用于LLMs的实践价值。 [结果/结论] 在方法技术方面,通过动态提示学习框架与领域适配微调等技术,LLMs极大提升科技文献数据挖掘精度与效度;在软件工具方面,已初步形成从数据标注、数据挖掘、合成数据到知识生成的全流程LLMs科技文献数据挖掘工具链;在应用方面,科技文献数据可为LLMs提供专业化语料和高质量数据,LLMs推动科技文献从单维数据服务向多模态知识生成服务的范式演进。然而,当前仍面临领域知识表征深度不足、跨模态推理效率较低、知识生成可解释性欠缺等挑战。未来应着重研发具有可解释性与跨领域适应性的LLMs科技文献数据挖掘工具,集成“人在回路”的协同机制,促进科技文献数据挖掘从效率优化向知识创造转变。

关键词: 科技文献数据挖掘, 大语言模型, AI4S, 数据驱动, 知识发现

Abstract:

[Purpose/Significance] Scientific literature contains rich domain knowledge and scientific data, which can provide high-quality data support for AI-driven scientific research (AI4S). This paper systematically reviews the methods, tools, and applications of arge language models (LLMs) in scientific literature data mining, and discusses their research directions and development trends. It addresses critical shortcomings in interdisciplinary knowledge extraction and provides practical insights to enhance AI4S workflows, thereby aligning AI capabilities with domain-specific scientific needs. [Method/Process] This study employs a systematic literature review and case analysis to formulate a tripartite framework: 1) Methodological dimension: Textual knowledge mining uses dynamic prompts, few-shot learning, and domain-adaptive pre-training (such as MagBERT and MatSciBERT) to improve entity recognition. Scientific data extraction uses chain-of-thought prompting and knowledge graphs (such as ChatExtract and SynAsk) to parse experimental datasets. Chart decoding uses neural networks to extract numerical values and semantic patterns from visual elements. 2) Tool dimension: This explores the core functionalities of notable AI tools, including data mining platforms (such as LitU, SciAIEngine) and knowledge generation systems (such as Agent Laboratory, VirSci), with a focus on multimodal processing and automation. 3) Application dimension: LLMs produce high-quality datasets to tackle the issue of data scarcity. They facilitate tasks such as predicting material properties and diagnosing medical conditions. The scientific credibility of these datasets is ensured through a process of "LLMs + expert validation". [Results/Conclusions] The findings indicate that LLMs significantly improve the automation of scientific literature mining. Methodologically, this research introduces dynamic prompt learning frameworks and domain adaptation fine-tuning technologies to address the shortcomings of traditional rule-driven approaches. In terms of tools, cross-modal parsing tools and interactive analysis platforms have been developed to facilitate end-to-end data mining and knowledge generation. In terms of applications, the study has accelerated the transition of scientific literature from single-modal to multimodal formats, thereby supporting the creation of high-quality scientific datasets, vertical domain-specific models, and knowledge service platforms. However, significant challenges remain, including insufficient depth of domain knowledge embedding, the low efficiency of multimodal data collaboration, and a lack of model interpretability. Future research should focus on developing interpretable LLMs with knowledge graph integration, improving cross-modal alignment techniques, and integrating "human-in-the-loop" systems to enhance reliability. It is also imperative to establish standardized data governance and intellectual property frameworks to promote the ethical utilization of scientific literature data. Such advances will facilitate a shift from efficiency optimization to knowledge generation in AI4S.

Key words: scientific literature data mining, large language models, AI for Science, data driven, knowledge discovery

中图分类号:  G350,G203

引用本文

蔡祎然, 胡正银, 刘春江. 大语言模型赋能科技文献数据挖掘进展分析[J]. 农业图书情报学报, 2025, 37(2): 4-22.

CAI Yiran, HU Zhengyin, LIU Chunjiang. Analysis of Progress in Data Mining of Scientific Literature Using Large Language Models[J]. Journal of library and information science in agriculture, 2025, 37(2): 4-22.