大语言模型赋能科技文献数据挖掘进展分析

doi:10.13998/j.cnki.issn1002-1248.25-0116

Abstract

Abstract:

[Purpose/Significance] Scientific literature contains rich domain knowledge and scientific data, which can provide high-quality data support for AI-driven scientific research (AI4S). This paper systematically reviews the methods, tools, and applications of arge language models (LLMs) in scientific literature data mining, and discusses their research directions and development trends. It addresses critical shortcomings in interdisciplinary knowledge extraction and provides practical insights to enhance AI4S workflows, thereby aligning AI capabilities with domain-specific scientific needs. [Method/Process] This study employs a systematic literature review and case analysis to formulate a tripartite framework: 1) Methodological dimension: Textual knowledge mining uses dynamic prompts, few-shot learning, and domain-adaptive pre-training (such as MagBERT and MatSciBERT) to improve entity recognition. Scientific data extraction uses chain-of-thought prompting and knowledge graphs (such as ChatExtract and SynAsk) to parse experimental datasets. Chart decoding uses neural networks to extract numerical values and semantic patterns from visual elements. 2) Tool dimension: This explores the core functionalities of notable AI tools, including data mining platforms (such as LitU, SciAIEngine) and knowledge generation systems (such as Agent Laboratory, VirSci), with a focus on multimodal processing and automation. 3) Application dimension: LLMs produce high-quality datasets to tackle the issue of data scarcity. They facilitate tasks such as predicting material properties and diagnosing medical conditions. The scientific credibility of these datasets is ensured through a process of "LLMs + expert validation". [Results/Conclusions] The findings indicate that LLMs significantly improve the automation of scientific literature mining. Methodologically, this research introduces dynamic prompt learning frameworks and domain adaptation fine-tuning technologies to address the shortcomings of traditional rule-driven approaches. In terms of tools, cross-modal parsing tools and interactive analysis platforms have been developed to facilitate end-to-end data mining and knowledge generation. In terms of applications, the study has accelerated the transition of scientific literature from single-modal to multimodal formats, thereby supporting the creation of high-quality scientific datasets, vertical domain-specific models, and knowledge service platforms. However, significant challenges remain, including insufficient depth of domain knowledge embedding, the low efficiency of multimodal data collaboration, and a lack of model interpretability. Future research should focus on developing interpretable LLMs with knowledge graph integration, improving cross-modal alignment techniques, and integrating "human-in-the-loop" systems to enhance reliability. It is also imperative to establish standardized data governance and intellectual property frameworks to promote the ethical utilization of scientific literature data. Such advances will facilitate a shift from efficiency optimization to knowledge generation in AI4S.

Key words: scientific literature data mining, large language models, AI for Science, data driven, knowledge discovery

CLC Number:

G350

CAI Yiran, HU Zhengyin, LIU Chunjiang. Analysis of Progress in Data Mining of Scientific Literature Using Large Language Models[J].Journal of library and information science in agriculture, 2025, 37(2): 4-22.

Figures/Tables 6

Fig.1

Fig.2

Fig.3

Table 1

Table 2

Typical tools for data mining in scientific literature using large language models"

类型	功能	工具名称	方法技术	应用场景
数据挖掘	文本知识挖掘	LitAI^[74]	OCR、上下文学习、少样本提示、思维链提示	文本抽取和结构化、文本质量增强文本分类、纠正语法错误、参考文献管理
		GOT-OCR2.0^[75]	注意力机制、上下文学习多阶段预训练、指令微调	文本识别、文档数字化
		SciAIEngine^[78]	自然语言处理、少样本提示、提示工程	语步识别、命名实体识别科技文献挖掘、深度聚类等
		MDocAgent^[83]	OCR、RAG 上下文学习、GraphRAG	多模态数据融合、文本识别和抽取、文档问答
		LongDocURL^[84]	特征融合、RAG、OCR	长文档解析、文档问答
	科学数据挖掘	LitAI^[74]	OCR、上下文学习少样本提示、思维链提示	科学数据抽取
		MinerU	上下文学习、多模态融合基于人类反馈的强化学习	多模态科学数据挖掘数字公式识别、方程式分子结构式挖掘
		TableGPT2^[85]	注意力机制、神经网络架构	表格数据理解、数据管理、数据计算分析
		GOT-OCR2.0^[75]	注意力机制、上下文学习多阶段预训练、指令微调	数字公式识别、方程式分子结构式挖掘
		olmOCR^[76]	OCR、文档锚定、微调、思维链提示	数字公式识别、方程式分子结构式挖掘
	图表信息挖掘	LitAI^[74]	OCR、上下文学习少样本提示、思维链提示	图注抽取与解释、图像数据与文本数据关联图表语义增强
	图表信息挖掘	olmOCR^[76]	OCR、微调、思维链提示	表格识别、提取图表中的关键数据点
知识生成	文献综述生成	Agent Laboratory^[80]	思维链提示、基于Transformer架构	文献综述、实验设计与分析代码生成、结果解释、报告撰写
		Web of Science研究助手	上下文学习、思维链提示	文献综述、期刊推荐、数据可视化
		SciAIEngine^[78]	自然语言处理、少样本提示、提示工程	文本标题生成、结构化自动综述
		Deep Research	自然语言处理、端到端强化学习	文献综述、论文润色、生成报告
		AutoSurvey^[54]	RAG、提示工程、词嵌入	初始检索与大纲生成、子章节起草整合与优化、评估与迭代
	知识发现	VirSci^[79]	RAG、多任务学习模型微调、GraphRAG	主题讨论、新颖性评估、摘要生成知识库构建、多智能体协作
	知识发现	星火科研助手^[6]	预训练、有监督微调基于人类反馈的强化学习	成果调研、综述生成、领域更新追踪论文研读、多文档问答、研究方向推荐

Table 2

Table 3

References 117

1	王飞跃, 缪青海. 人工智能驱动的科学研究新范式: 从AI4S到智能科学[J]. 中国科学院院刊, 2023, 38(4): 536-540.
	WANG F Y, MIAO Q H. Novel paradigm for AI-driven scientific research: From AI4S to intelligent science[J]. Bulletin of Chinese academy of sciences, 2023, 38(4): 536-540.
2	李国杰. 智能化科研(AI4R): 第五科研范式[J]. 中国科学院院刊, 2024, 39(1): 1-9.
	LI G J. AI4R: The fifth scientific research paradigm[J]. Bulletin of Chinese academy of sciences, 2024, 39(1): 1-9.
3	罗威, 谭玉珊. 基于内容的科技文献大数据挖掘与应用[J]. 情报理论与实践, 2021, 44(6): 154-157.
	LUO W, TAN Y S. Content-based data mining and application of scientific and technical literature big data[J]. Information studies: Theory & application, 2021, 44(6): 154-157.
4	熊泽润, 宋立荣. 科学数据出版中同行评议的问题思考[J]. 中国科技资源导刊, 2022, 54(5): 21-29.
	XIONG Z R, SONG L R. Thinking about peer review in scientific data publishing[J]. China science & technology resources review, 2022, 54(5): 21-29.
5	代冰, 胡正银. 基于文献的知识发现新近研究综述[J]. 数据分析与知识发现, 2021, 5(4): 1-12.
	DAI B, HU Z Y. Review of studies on literature-based discovery[J]. Data analysis and knowledge discovery, 2021, 5(4): 1-12.
6	钱力, 张智雄, 伍大勇, 等. 科技文献大模型: 方法、框架与应用[J]. 中国图书馆学报, 2024, 50(6): 45-58.
	QIAN L, ZHANG Z X, WU D Y, et al. The large language model for scientific literature: Method, framework, and application[J]. Journal of library science in China, 2024, 50(6): 45-58.
7	支凤稳, 赵梦凡, 彭兆祺. 开放科学环境下科学数据与科技文献关联模式研究[J]. 数字图书馆论坛, 2023(10): 52-61.
	ZHI F W, ZHAO M F, PENG Z Q. Relevance pattern of scientific data and scientific literature in open science environment[J]. Digital library forum, 2023(10): 52-61.
8	李泽宇, 刘伟. 基于大语言模型全流程微调的叙词表等级关系构建研究[J]. 情报理论与实践, 2025, 48(4): 152-162.
	LI Z Y, LIU W. Research on the construction of hierarchical relationships in thesaurus based on the full-process fine-tuning of large language model[J]. Information studies: Theory & application, 2025, 48(4): 152-162.
9	曾建勋. “十四五”期间我国科技情报事业的发展思考[J]. 情报理论与实践, 2021, 44(1): 1-7.
	ZENG J X. Reflection on the development of China's scientific and technical information industry during the "14th Five-Year Plan" period[J]. Information studies: Theory & application, 2021, 44(1): 1-7.
10	TSAI C W, LAI C F, CHAO H C, et al. Big data analytics: A survey[J]. Journal of big data, 2015, 2(1): 21.
11	赵冬晓, 王效岳, 白如江, 等. 面向情报研究的文本语义挖掘方法述评[J]. 现代图书情报技术, 2016(10): 13-24.
	ZHAO D X, WANG X Y, BAI R J, et al. Semantic text mining methodologies for intelligence analysis[J]. New technology of library and information service, 2016(10): 13-24.
12	车万翔, 窦志成, 冯岩松, 等. 大模型时代的自然语言处理: 挑战、机遇与发展[J]. 中国科学: 信息科学, 2023, 53(9): 1645-1687.
	CHE W X, DOU Z C, FENG Y S, et al. Towards a comprehensive understanding of the impact of large language models on natural language processing: Challenges, opportunities and future directions[J]. Scientia sinica (informationis), 2023, 53(9): 1645-1687.
13	张智雄, 于改红, 刘熠, 等. ChatGPT对文献情报工作的影响[J]. 数据分析与知识发现, 2023, 7(3): 36-42.
	ZHANG Z X, YU G H, LIU Y, et al. The influence of ChatGPT on library & information services[J]. Data analysis and knowledge discovery, 2023, 7(3): 36-42.
14	刘熠, 张智雄, 王宇飞, 等. 基于语步识别的科技文献结构化自动综合工具构建[J]. 数据分析与知识发现, 2024, 8(2): 65-73.
	LIU Y, ZHANG Z X, WANG Y F, et al. Constructing automatic structured synthesis tool for sci-tech literature based on move recognition[J]. Data analysis and knowledge discovery, 2024, 8(2): 65-73.
15	常志军, 钱力, 吴垚葶, 等. 面向主题场景的科技文献AI数据体系建设: 技术框架研究与实践[J]. 农业图书情报学报, 2024, 36(9): 4-17.
	CHANG Z J, QIAN L, WU Y T, et al. Construction of a scientific literature AI data system for the thematic scenario: Technical framework research and practice[J]. Journal of library and information science in agriculture, 2024, 36(9): 4-17.
16	梁爽, 刘小平. 基于文本挖掘的科技文献主题演化研究进展[J]. 图书情报工作, 2022, 66(13): 138-149.
	LIANG S, LIU X P. Research progress on topic evolution of scientific and technical literatures based on text mining[J]. Library and information service, 2022, 66(13): 138-149.
17	JIANG M. Very large language model as a unified methodology of text mining[J/OL]. arXiv preprint arXiv:2212.09271, 2022.
18	HUANG Q, SUN Y B, XING Z C, et al. API entity and relation joint extraction from text via dynamic prompt-tuned language model[J]. ACM transactions on software engineering and methodology, 2024, 33(1): 1-25.
19	GUPTA S, MAHMOOD A, SHETTY P, et al. Data extraction from polymer literature using large language models[J]. Communications materials, 2024, 5: 269.
20	KUMAR S, JAAFREH R, SINGH N, et al. Introducing MagBERT: A language model for magnesium textual data mining and analysis[J]. Journal of magnesium and alloys, 2024, 12(8): 3216-3228.
21	GUPTA T, ZAKI M, ANOOP KRISHNAN N M, et al. MatSciBERT: A materials domain language model for text mining and information extraction[J]. NPJ computational materials, 2022, 8: 102.
22	LIU Y F, LI S Y, DENG Y, et al. SSuieBERT: Domain adaptation model for Chinese space science text mining and information extraction[J]. Electronics, 2024, 13(15): 2949.
23	李盼飞, 杨小康, 白逸晨, 等. 基于大语言模型的中医医案命名实体抽取研究[J]. 中国中医药图书情报杂志, 2024, 48(2): 108-113.
	LI P F, YANG X K, BAI Y C, et al. Study on named entity extraction in TCM medical records based on large language models[J]. Chinese journal of library and information science for traditional Chinese medicine, 2024, 48(2): 108-113.
24	杨冬菊, 黄俊涛. 基于大语言模型的中文科技文献标注方法[J]. 计算机工程, 2024, 50(9): 113-120.
	YANG D J, HUANG J T. Chinese scientific literature annotation method based on large language model[J]. Computer engineering, 2024, 50(9): 113-120.
25	WEI X, CUI X Y, CHENG N, et al. ChatIE: Zero-shot information extraction via chatting with ChatGPT[J/OL]. e-printsarXiv, arXiv: 2302.10205., 2023.
26	ZHENG Z L, ZHANG O F, BORGS C, et al. ChatGPT chemistry assistant for text mining and the prediction of MOF synthesis[J]. Journal of the American chemical society, 2023, 145(32): 18048-18062.
27	陆伟, 刘寅鹏, 石湘, 等. 大模型驱动的学术文本挖掘: 推理端指令策略构建及能力评测[J]. 情报学报, 2024, 43(8): 946-959.
	LU W, LIU Y P, SHI X, et al. Large language model-driven academic text mining: Construction and evaluation of inference-end prompting strategy[J]. Journal of the China society for scientific and technical information, 2024, 43(8): 946-959.
28	杨金庆, 吴乐艳, 魏雨晗, 等. 科技文献新兴话题识别研究进展[J]. 情报学进展, 2020, 13(00): 202-234.
	YANG J Q, WU L Y, WEI Y H, et al. Research progress on the identification of emerging topics in scientific and technological literature[J]. Advances in information science, 2020, 13(00): 202-234.
29	POLAK M P, MORGAN D. Extracting accurate materials data from research papers with conversational language models and prompt engineering[J]. Nature communications, 2024, 15: 1569.
30	XIE T, WAN Y W, HUANG W, et al. DARWIN series: Domain specific large language models for natural science[J/OL]. arXiv preprint arXiv:2308.13565, 2023.
31	杨帅, 刘建军, 金帆, 等. 人工智能与大数据在材料科学中的融合: 新范式与科学发现[J]. 科学通报, 2024, 69(32): 4730-4747.
	YANG S, LIU J J, JIN F, et al. Integration of artificial intelligence and big data in materials science: New paradigms and scientific discoveries[J]. Chinese science bulletin, 2024, 69(32): 4730-4747.
32	SZYMANSKI N J, RENDY B, FEI Y X, et al. An autonomous laboratory for the accelerated synthesis of novel materials[J]. Nature, 2023, 624(7990): 86-91.
33	AI Q X, MENG F W, SHI J L, et al. Extracting structured data from organic synthesis procedures using a fine-tuned large language model[J]. Digital discovery, 2024, 3(9): 1822-1831.
34	ZHANG C H, LIN Q H, ZHU B W, et al. SynAsk: Unleashing the power of large language models in organic synthesis[J]. Chemical science, 2025, 16(1): 43-56.
35	GAO Y J, MYERS S, CHEN S, et al. When raw data prevails: Are large language model embeddings effective in numerical data representation for medical machine learning applications?[J/OL]. arXiv preprint arXiv:2408.11854, 2024.
36	DU Y, WANG L D, HUANG M Y, et al. Autodive: An integrated onsite scientific literature annotation tool[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations). Toronto, Canada. Stroudsburg, PA, USA: ACL, 2023: 76-85.
37	ZHANG Y, CHEN X S, JIN B W, et al. A comprehensive survey of scientific large language models and their applications in scientific discovery[J/OL]. arXiv preprint arXiv:2406.10833, 2024.
38	JETHANI N, JONES S, GENES N, et al. Evaluating ChatGPT in information extraction: A case study of extracting cognitive exam dates and scores[J/OL]. medRxiv, 2023.
39	JAMI H C, SINGH P R, KUMAR A, et al. CCU-llama: A knowledge extraction LLM for carbon capture and utilization by mining scientific literature data[J]. Industrial & engineering chemistry research, 2024, 63(41): 17585-17598.
40	Automating scientific knowledge extraction and modeling (ASKEM)[EB/OL]. [2025-01-14].
41	于丰畅, 程齐凯, 陆伟. 基于几何对象聚类的学术文献图表定位研究[J]. 数据分析与知识发现, 2021, 5(1): 140-149.
	YU F C, CHENG Q K, LU W. Locating academic literature figures and tables with geometric object clustering[J]. Data analysis and knowledge discovery, 2021, 5(1): 140-149.
42	于丰畅, 陆伟. 一种学术文献图表位置标注数据集构建方法[J]. 数据分析与知识发现, 2020, 4(6): 35-42.
	YU F C, LU W. Constructing data set for location annotations of academic literature figures and tables[J]. Data analysis and knowledge discovery, 2020, 4(6): 35-42.
43	MASSON D, MALACRIA S, VOGEL D, et al. ChartDetective: Easy and accurate interactive data extraction from complex vector charts[C]//Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. Hamburg Germany. ACM, 2023: 1-17.
44	ZHOU F F, ZHAO Y, CHEN W J, et al. Reverse-engineering bar charts using neural networks[J]. Journal of visualization, 2021, 24(2): 419-435.
45	黄梓航, 陈令羽, 蒋秉川. 基于文本解析的栅格类图表知识抽取方法[J]. 地理空间信息, 2023, 21(10): 23-27.
	HUANG Z H, CHEN L Y, JIANG B C. Knowledge extraction method for raster chart based on text parsing[J]. Geospatial information, 2023, 21(10): 23-27.
46	LUO J Y, LI Z K, WANG J P, et al. ChartOCR: Data extraction from charts images via a deep hybrid framework[C]//2021 IEEE Winter Conference on Applications of Computer Vision (WACV). January 3-8, 2021, Waikoloa, HI, USA. IEEE, 2021: 1916-1924.
47	琚江舟, 毛云麟, 吴震, 等. 多粒度单元格对比的文本和表格数值问答模型[J/OL]. 软件学报, 2024: 1-21.
	JU J Z, MAO Y L, WU Z, et al. Text and table numerical question answering model for multi-granularity cell comparison[J/OL]. Journal of software, 2024: 1-21.
48	容姿, 丁一, 李依泽, 等. 图表大数据解析方法综述[J]. 计算机辅助设计与图形学学报, 2025, 37(2): 216-228.
	RONG Z, DING Y, LI Y Z, et al. Review of parsing methods for big data in chart[J]. Journal of computer-aided design & computer graphics, 2025, 37(2): 216-228.
49	WU A Y, WANG Y, SHU X H, et al. AI4VIS: Survey on artificial intelligence approaches for data visualization[J]. IEEE transactions on visualization and computer graphics, 2022, 28(12): 5049-5070.
50	MISHRA P, KUMAR S, CHAUBE M K. Evaginating scientific charts: Recovering direct and derived information encodings from chart images[J]. Journal of visualization, 2022, 25(2): 343-359.
51	ZHAO J Y, HUANG S, COLE J M. OpticalBERT and OpticalTable-SQA: Text- and table-based language models for the optical-materials domain[J]. Journal of chemical information and modeling, 2023, 63(7): 1961-1981.
52	黎颖, 吴清锋, 刘佳桐, 等. 引导性权重驱动的图表问答重定位关系网络[J]. 中国图象图形学报, 2023, 28(2): 510-521.
	LI Y, WU Q F, LIU J T, et al. Leading weight-driven re-position relation network for figure question answering[J]. Journal of image and graphics, 2023, 28(2): 510-521.
53	LUO R, SASTIMOGLU Z, FAISAL A I, et al. Evaluating the efficacy of large language models for systematic review and meta-analysis screening[J/OL]. medRxiv, 2024.
54	WANG Y, GUO Q, YAO W, et al. AutoSurvey: Large language models can automatically write surveys[J]. Advances in neural information processing systems, 2024, 37: 115119-115145.
55	周莉. 生成式人工智能对学术期刊的变革与赋能研究[J]. 黄冈师范学院学报, 2024, 44(6): 57-60.
	ZHOU L. The reform and empowerment of generative artificial intelligence to academic journals[J]. Journal of Huanggang normal university, 2024, 44(6): 57-60.
56	WANG S, SCELLS H, KOOPMAN B, et al. Can ChatGPT write a good Boolean query for systematic review literature search? [C]//Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2023: 1426-1436.
57	ANTU S A, CHEN H, RICHARDS C K. Using LLM (large language model) to improve efficiency in literature review for undergraduate research[J]. LLM@AIED, 2023: 8-16.
58	WU S C, MA X, LUO D H, et al. Automated review generation method based on large language models[J/OL]. arXiv preprint arXiv:2407.20906, 2024.
59	姜鹏, 任龑, 朱蓓琳. 大语言模型在分类标引工作中的应用探索[J]. 农业图书情报学报, 2024, 36(5): 32-42.
	JIANG P, REN Y, ZHU B L. Exploration and practice of classification indexing combined with large language models[J]. Journal of library and information science in agriculture, 2024, 36(5): 32-42.
60	YAN X C, FENG S Y, YUAN J K, et al. SurveyForge: On the outline heuristics, memory-driven generation, and multi-dimensional evaluation for automated survey writing[J/OL]. arXiv preprint arXiv:2503.04629, 2025.
61	LUO Z M, YANG Z L, XU Z X, et al. LLM4SR: A survey on large language models for scientific research[J/OL]. arXiv preprint arXiv:2501.04306, 2025.
62	马畅, 田永红, 郑晓莉, 等. 基于知识蒸馏的神经机器翻译综述[J]. 计算机科学与探索, 2024, 18(7): 1725-1747.
	MA C, TIAN Y H, ZHENG X L, et al. Survey of neural machine translation based on knowledge distillation[J]. Journal of frontiers of computer science and technology, 2024, 18(7): 1725-1747.
63	陈文杰, 胡正银, 石栖, 等. 融合知识图谱与大语言模型的科技文献复杂知识对象抽取研究[J/OL]. 现代情报, 2024: 1-20.
	CHEN W J, HU Z Y, SHI X, et al. Research on scientific and technological literature complex knowledge object extraction fusing knowledge graph and large language model[J/OL]. Journal of modern information, 2024: 1-20.
64	KUMICHEV G, BLINOV P, KUZKINA Y, et al. MedSyn: LLM-based synthetic medical text generation framework[M]//Machine Learning and Knowledge Discovery in Databases. Applied Data Science Track. Cham: Springer Nature Switzerland, 2024: 215-230.
65	YANG D Y, MONAIKUL N, DING A, et al. Enhancing table representations with LLM-powered synthetic data generation[J/OL]. arXiv preprint arXiv:2411.03356, 2024.
66	ZHEZHERAU A, YANOCKIN A. Hybrid training approaches for LLMs: Leveraging real and synthetic data to enhance model performance in domain-specific applications[J/OL]. arXiv preprint arXiv:2410.09168, 2024.
67	GUO X, CHEN Y Q. Generative AI for synthetic data generation: Methods, challenges and the future[J/OL]. arXiv preprint arXiv:2403.04190, 2024.
68	LONG L, WANG R, XIAO R X, et al. On LLMs-driven synthetic data generation, curation, and evaluation: A survey[J/OL]. arXiv preprint arXiv:2406.15126, 2024.
69	KIM S, SUK J, YUE X, et al. Evaluating language models as synthetic data generators[J/OL]. arXiv preprint arXiv:2412.03679, 2024.
70	GOUGHERTY A V, CLIPP H L. Testing the reliability of an AI-based large language model to extract ecological information from the scientific literature[J]. NPJ biodiversity, 2024, 3: 13.
71	ZHANG J J, BAI Y S, LV X, et al. LongCite: Enabling LLMs to generate fine-grained citations in long-context QA[J/OL]. arXiv preprint arXiv:2409.02897, 2024.
72	ZHANG W, WANG Q G, KONG X T, et al. Fine-tuning large language models for chemical text mining[J]. Chemical science, 2024, 15(27): 10600-10611.
73	XIAO T, ZHU J B. Foundations of large language models[J/OL]. arXiv preprint arXiv:2501.09223, 2025.
74	MEDISETTI G, COMPSON Z, FAN H, et al. LitAI: Enhancing multimodal literature understanding and mining with generative AI[J]. Proceedings IEEE conference on multimedia information processing and retrieval, 2024, 2024: 471-476.
75	WEI H R, LIU C L, CHEN J Y, et al. General OCR theory: Towards OCR-2.0 via a unified end-to-end model[J/OL]. arXiv preprint arXiv:2409.01704, 2024.
76	POZNANSKI J, BORCHARDT J, DUNKELBERGER J, et al. olmOCR: Unlocking trillions of tokens in PDFs with vision language models[J/OL]. arXiv preprint arXiv:2502.18443, 2025.
77	SU A F, WANG A W, YE C, et al. TableGPT2: A large multimodal model with tabular data integration[J/OL]. arXiv preprint arXiv:2411.02059, 2024.
78	张智雄, 刘欢, 于改红. 构建基于科技文献知识的人工智能引擎[J]. 农业图书情报学报, 2021, 33(1): 17-31.
	ZHANG Z X, LIU H, YU G H. Building an artificial intelligence engine based on scientific and technological literature knowledge[J]. Journal of library and information science in agriculture, 2021, 33(1): 17-31.
79	SU H, CHEN R, TANG S, et al. Two heads are better than one: A multi-agent system has the potential to improve scientific idea generation[J/OL]. arXiv preprint arXiv:2410. 09403v2, 2024.
80	SCHMIDGALL S, SU Y S, WANG Z, et al. Agent laboratory: Using LLM agents as research assistants[J/OL]. arXiv preprint arXiv:2501.04227, 2025.
81	XI Z K, YIN W B, FANG J Z, et al. OmniThink: Expanding knowledge boundaries in machine writing through thinking[J/OL]. arXiv preprint arXiv:2501.09751, 2025.
82	KANG Y, KIM J. ChatMOF: An artificial intelligence system for predicting and generating metal-organic frameworks using large language models[J]. Nature communications, 2024, 15: 4705.
83	HAN S W, XIA P, ZHANG R Y, et al. MDocAgent: A multi-modal multi-agent framework for document understanding[J/OL]. arXiv preprint arXiv:2503.13964, 2025.
84	DENG C, YUAN J L, BU P, et al. LongDocURL: A comprehensive multimodal long document benchmark integrating understanding, reasoning, and locating[J/OL]. arXiv preprint arXiv:2412.18424, 2024.
85	ZHA L Y, ZHOU J L, LI L Y, et al. TableGPT: Towards unifying tables, nature language and commands into one GPT[J/OL]. arXiv preprint arXiv:2307.08674, 2023.
86	王译婧, 徐海静. 人工智能助力多模态档案资源开发的实现路径[J]. 山西档案, 2025(4): 120-126, 137.
	WANG Y J, XU H J. Implementation paths for AI-assisted development of multimodal archival resources[J]. Shanxi archives, 2025(4): 120-126, 137.
87	王飞跃, 王雨桐. 数字科学家与平行科学: AI4S和S4AI的本源与目标[J]. 中国科学院院刊, 2024, 39(1): 27-33.
	WANG F Y, WANG Y T. Digital scientists and parallel sciences: The origin and goal of AI for science and science for AI[J]. Bulletin of Chinese academy of sciences, 2024, 39(1): 27-33.
88	WANG H C, LIU C, XI N W, et al. HuaTuo: Tuning LLaMA model with Chinese medical knowledge[J/OL]. arXiv preprint arXiv:2304.06975, 2023.
89	BI Z, ZHANG N Y, XUE Y D, et al. OceanGPT: A large language model for ocean science tasks[J/OL]. arXiv preprint arXiv:2310.02031, 2023.
90	鲜国建, 罗婷婷, 赵瑞雪, 等. 从人工密集型到计算密集型: NSTL数据库建设模式转型之路[J]. 数字图书馆论坛, 2020(7): 52-59.
	XIAN G J, LUO T T, ZHAO R X, et al. Research and practice of the NSTL database construction mode transformation: From labor intensive to computing intensive[J]. Digital library forum, 2020(7): 52-59.
91	王婷, 何松泽, 杨川. 知识图谱相关方法在脑科学领域的应用综述[J]. 计算机技术与发展, 2022, 32(11): 1-7.
	WANG T, HE S Z, YANG C. An application review of knowledge graph related methods in field of human brain science[J]. Computer technology and development, 2022, 32(11): 1-7.
92	MALAS T B, VLIETSTRA W J, KUDRIN R, et al. Drug prioritization using the semantic properties of a knowledge graph[J]. Scientific reports, 2019, 9: 6281.
93	JARADEH M Y, OELEN A, PRINZ M, et al. Open research knowledge graph: A system walkthrough[M]//Digital Libraries for Open Knowledge. Cham: Springer International Publishing, 2019: 348-351.
94	萧文科, 宋驰, 陈士林, 等. 中医药大语言模型的关键技术与构建策略[J]. 中草药, 2024, 55(17): 5747-5756.
	XIAO W K, SONG C, CHEN S L, et al. Key technologies and construction strategies of large language models for traditional Chinese medicine[J]. Chinese traditional and herbal drugs, 2024, 55(17): 5747-5756.
95	SWAIN M C, COLE J M. ChemDataExtractor: A toolkit for automated extraction of chemical information from the scientific literature[J]. Journal of chemical information and modeling, 2016, 56(10): 1894-1904.
96	LAI P T, COUDERT E, AIMO L, et al. EnzChemRED, a rich enzyme chemistry relation extraction dataset[J]. Scientific data, 2024, 11: 982.
97	LIU Y, LIU D-H, GE X-Y, et al. A high-quality dataset construction method for text mining in materials science[J]. Acta physica sinica, 2023, 72(7): 070701.
98	ZHANG Y, WANG C, SOUKASEUM M, et al. Unleashing the power of knowledge extraction from scientific literature in catalysis[J]. Journal of chemical information and modeling, 2022, 62(14): 3316-3330.
99	RUBUNGO A N, LI K M, HATTRICK-SIMPERS J, et al. LLM4Mat-bench: Benchmarking large language models for materials property prediction[J/OL]. arXiv preprint arXiv:2411.00177, 2024.
100	TOSSTORFF A, RUDOLPH M G, COLE J C, et al. A high quality, industrial data set for binding affinity prediction: Performance comparison in different early drug discovery scenarios[J]. Journal of computer-aided molecular design, 2022, 36(10): 753-765.
101	孟小峰. 科学数据智能: 人工智能在科学发现中的机遇与挑战[J]. 中国科学基金, 2021, 35(3): 419-425.
	MENG X F. Scientific data intelligence: AI for scientific discovery[J]. Bulletin of national natural science foundation of China, 2021, 35(3): 419-425.
102	高瑜蔚, 胡良霖, 朱艳华, 等. 国家基础学科公共科学数据中心建设与发展实践[J]. 科学通报, 2024, 69(24): 3578-3588.
	GAO E G, HU L L, ZHU Y H, et al. Construction and practice of national basic science data center[J]. Chinese science bulletin, 2024, 69(24): 3578-3588.
103	邓仲华, 李志芳. 科学研究范式的演化: 大数据时代的科学研究第四范式[J]. 情报资料工作, 2013, 34(4): 19-23.
	DENG Z H, LI Z F. The evolution of scientific research paradigm: The fourth paradigm of scientific research in the era of big data[J]. Information and documentation services, 2013, 34(4): 19-23.
104	包为民, 祁振强. 航天装备体系化仿真发展的思考[J]. 系统仿真学报, 2024, 36(6): 1257-1272.
	BAO W M, QI Z Q. Thinking of aerospace equipment systematization simulation technology development[J]. Journal of system simulation, 2024, 36(6): 1257-1272.
105	李正风. 当代科学的新变化与科学学的新趋向[J]. 世界科学, 2024(8): 41-44.
	LI Z F. New changes in contemporary science and new trends in science of science[J]. World science, 2024(8): 41-44.
106	The Fourth Paradigm: Data-Intensive Scientific Discovery[M]. Redmond, WA: Microsoft Research, 2009.
107	余江, 张越, 周易. 人工智能驱动的科研新范式及学科应用研究[J]. 中国科学院院刊, 2025, 40(2): 362-370.
	YU J, ZHANG Y, ZHOU Y. A new scientific research paradigm driven by AI and its applications in academic disciplines[J]. Bulletin of Chinese academy of sciences, 2025, 40(2): 362-370.
108	于改红, 谢靖, 张智雄, 等. 基于DIKIW的智能情报服务理论及系统框架研究与实践[J/OL]. 情报理论与实践, 2025: 1-11.
	YU G H, XIE J, ZHANG Z X, et al. Research and practice of intelligent information service theory and system framework based on DIKIW[J/OL]. Information studies: Theory & application, 2025: 1-11.
109	张智雄. 在开放科学和AI时代塑造新型学术交流模式[J]. 中国科技期刊研究, 2024, 35(5): 561-567.
	ZHANG Z X. Shaping new models of scholarly communication in the era of open science and AI[J]. Chinese journal of scientific and technical periodicals, 2024, 35(5): 561-567.
110	钱力, 刘细文, 张智雄, 等. AI+智慧知识服务生态体系研究设计与应用实践: 以中国科学院文献情报中心智慧服务平台建设为例[J]. 图书情报工作, 2021, 65(15): 78-90.
	QIAN L, LIU X W, ZHANG Z X, et al. Design and application of ecological system of intelligent knowledge service based on AI: An example of building of intelligent service platform of national science library, CAS[J]. Library and information service, 2021, 65(15): 78-90.
111	AMMAR W, GROENEVELD D, BHAGAVATULA C, et al. Construction of the literature graph in semantic scholar[J/OL]. arXiv preprint arXiv:1805.02262, 2018.
112	NI Z Q, LI Y H, HU K J, et al. MatPilot: An LLM-enabled AI materials scientist under the framework of human-machine collaboration[J/OL]. arXiv preprint arXiv:2411.08063, 2024.
113	WANG T R, HU J Y, OUYANG R H, et al. Nature of metal-support interaction for metal catalysts on oxide supports[J]. Science, 2024, 386(6724): 915-920.
114	FÉBBA D, EGBO K, CALLAHAN W A, et al. From text to test: AI-generated control software for materials science instruments[J]. Digital discovery, 2025, 4(1): 35-45.
115	周力虹. 面向驱动AI4S的科学数据聚合: 需求、挑战与实现路径[J]. 农业图书情报学报, 2023, 35(10): 13-15.
	ZHOU L H. Scientific data aggregation for driving AI4S: Requirements, challenges and implementation paths[J]. Journal of library and information science in agriculture, 2023, 35(10): 13-15.
116	叶悦. AI大模型时代出版内容数据保护的理据与进路[J]. 出版与印刷, 2025(1): 27-36.
	YE Y. The rationale and approach for data protection of published contents in the era of AI big models[J]. Publishing & printing, 2025(1): 27-36.
117	QU Y Y, DING M, SUN N, et al. The frontier of data erasure: Machine unlearning for large language models[J/OL]. arXiv preprint arXiv:2403.15779, 2024.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

类型	场景	核心技术	典型案例
支撑构建通用大模型与垂直大模型	通用大模型	自监督学习、指令微调、迁移学习、RAG、基于人类反馈的强化学习、领域知识注入、提示工程	PubScholar^[108]集成科技资源、ORKG^[93]结构化描述科技文献
支撑构建通用大模型与垂直大模型	垂直领域大模型	自监督学习、指令微调、迁移学习、RAG、基于人类反馈的强化学习、领域知识注入、提示工程	星火科研助手^[6]、Web of Science研究助手、材料科学文本挖掘和信息抽取的语言模型MatSciBERT^[21]、医疗诊断模型HuaTuo^[88]、面向海洋科学的大语言模型OceanGPT^[89]、脑科学关联知识图谱^[91]、公共生命科学数据语义整合知识库Euretos^[92]
支撑开发高质量数据集	AI4S科技文献数据库	主动学习、RAG、提示工程多智能体协作、领域知识注入	酶化学关系抽取数据集EnzChemRED^[96]、催化科学数据集Catalysis Hub^[100]、材料科学数据集LLM4Mat-Bench^[99]
支持AI驱动科学发现	假设生成	上下文学习、思维链提示基于文献的发现、人机协作	科技文献知识驱动的AI引擎SciAIEngine^[78]、预测和生成金属有机框架的人工智能系统ChatMOF^[82]
	实验验证		人工智能材料科学家MatPilot^[112]
	决策支持		公共生命科学数据语义整合知识库Euretos^[92]

[1]	Haoxian WANG, Ziming ZHOU, Feifei DING, Chengfu WEI. Digital Humanities & Large Language Models: Practice and Research in Semantic Retrieval of Ancient Documents [J]. Journal of library and information science in agriculture, 2024, 36(9): 89-101.
[2]	GUO Pengrui, WEN Tingxiao. Research of the Impact of LLMs on Information Retrieval Systems and Users' Information Retrieval Behavior [J]. Journal of library and information science in agriculture, 2023, 35(11): 13-22.
[3]	DUAN Weiwen. The Challenge of Artificial Intelligence Scientists to the Epistemology of Science [J]. Journal of library and information science in agriculture, 2023, 35(11): 4-12.
[4]	LI Hui, CHEN Tao, SHAN Rongrong. Conversations across Time and Space: Constructing Memory Chains of Calligraphies and Paintings on the Basis of IIIF-IIP Platform [J]. Journal of library and information science in agriculture, 2020, 32(9): 15-21.
[5]	DU Jian. Biomedical Knowledge Discovery Based on Big Data Linkage Analysis [J]. Journal of library and information science in agriculture, 2019, 31(3): 4-9.
[6]	WANG Jian, MA Jian. Research on Evaluation Model for Network Information Service Based on Data Driven [J]. Journal of library and information science in agriculture, 2019, 31(2): 30-35.
[7]	JIANG Lu. Research on Library's Embedded Knowledge Discovery and Intelligence Analysis Service in Big Data Era [J]. Journal of library and information science in agriculture, 2018, 30(8): 152-155.
[8]	CHEN Liang. Research on Knowledge Discovery Service for Digital Library Collection Resources Based on Semantic Association [J]. Journal of library and information science in agriculture, 2018, 30(3): 38-41.
[9]	LEI Yun. Application of Knowledge Discovery in Subject Service Based on Process Driven Strategy [J]. Journal of library and information science in agriculture, 2015, 27(8): 172-174.

Analysis of Progress in Data Mining of Scientific Literature Using Large Language Models

RichHTML

PDF (PC)

Abstract

Cite this article

share this article

Figures/Tables 6

References 117

Related Articles 9

Metrics

Comments

Recommended 0

类别	功能	方法技术
数据挖掘	文本知识挖掘	上下文学习^[17,27,50]、少样本提示^{[19,21,22,26]}、零样本提示^[24,25]、思维链提示^[27] 工具调用与API集成^[18,70]、GraphRAG^[24]、微调^[18,26]、预训练^[18,20-22]、RAG^[71]
	科学数据挖掘	思维链提示^[29,34]、GraphRAG^[39]、微调^[30,72]、主动学习^{[36,39,40,70]}、自动推理与规划^[32,34]
	图表信息挖掘	上下文学习^[50]、预训练^[46,47,50]、卷积循环神经网络^[44,45]、深度神经网络^[46]、注意力机制^[44]
知识生成	文献综述生成	少样本提示^[54]、RAG^{[18,53,54,58]}、微调^[18,71]
知识生成	合成数据生成	上下文学习^[26,68]、少样本提示^[26,73]、GraphRAG^[63,64]、自动推理与规划^[32,34]、微调^[26,33]