农业图书情报学报 ›› 2025, Vol. 37 ›› Issue (9): 63-81.doi: 10.13998/j.cnki.issn1002-1248.25-0513

• 研究论文 • 上一篇    

基于大语言模型数据增强的“科学-技术”主题关联方法研究——以节能领域为例

王晓宇1, 胡靖源1, 巫若羽1, 王舒2, 翟羽佳3   

  1. 1. 东北财经大学 管理科学与工程学院,大连 116025
    2. 大连理工大学 图书馆,大连 116024
    3. 天津师范大学 管理学院,天津 330387
  • 收稿日期:2025-09-23 出版日期:2025-09-05 发布日期:2025-12-08
  • 作者简介:

    王晓宇(1989- ),女,博士,讲师,研究方向为机器学习与知识发现

    胡靖源(2000- ),女,硕士研究生,研究方向为知识发现

    巫若羽(2002- ),女,本科,研究方向为机器学习

    王舒(1985- ),女,硕士,馆员,研究方向为专利分析与数据可视化

    翟羽佳(1988- ),男,博士,副教授,研究方向为AI驱动知识发现

  • 基金资助:
    辽宁省社科规划基金青年项目“人机协同模式下专利审查决策机制与效能提升研究”(L24CTQ003)

An LLM-based Data Augmentation Method for Constructing Science & Technology Topic Linkages: Taking the Energy Conservation Field as an Example

WANG Xiaoyu1, HU Jingyuan1, WU Ruoyu1, WANG Shu2, ZHAI Yujia3   

  1. 1. Dongbei University of Finance and Economics, Dalian 116025
    2. Dalian University of Technology, Dalian 116024
    3. Management School of Tianjin Normal University, Tianjin 330387
  • Received:2025-09-23 Online:2025-09-05 Published:2025-12-08

摘要:

【目的/意义】 面向主题粒度下的科技链接构建,提出基于大语言模型(LLMs)的数据增强方法以挖掘论文和专利主题间的潜在关联。 【方法/过程】 将LLMs作为桥接科技领域的知识库,利用ChatGPT-4对节能领域的论文和专利进行同义变体推断,将增强文本特征用于非专利引文预测任务以进行验证。 【结果/结论】 4个基线模型的实验结果表明,从增强文本中提取的相关性特征可提升非专利引文预测准确性,AUC指标增幅为13.91%、16.90%、16.21%和15.69%。该方法可突破论文和专利文本的语义鸿沟,为非同源文本的术语对齐、科技主题潜在语义关联挖掘等提供新的思路和技术参考。鉴于当前研究仅在单一领域验证,未来需在多领域探索方法的有效性和适用边界。

关键词: 大语言模型, 数据增强, 科学技术链接, 主题相似性

Abstract:

[Purpose/Significance] In the contemporary era of rapid technological advancement, understanding the intrinsic linkages between scientific research and technological innovation is critical for guiding strategic decision-making, optimizing resource allocation, and promoting effective technology transfer. Scientific publications and patents represent two complementary yet heterogeneous knowledge sources, with distinct linguistic styles, terminologies, and documentation structures, which often create a significant semantic gap. Traditional methods of linking scientific and technological (S&T) knowledge rely primarily on lexical overlap, keyword co-occurrence, or citation analysis. These methods are limited in their ability to capture deeper semantic relationships, particularly across non-homologous texts. To address this challenge, this study proposes a novel approach leveraging large language models (LLMs) for data augmentation, aiming to uncover latent semantic associations between research paper topics and patent topics. The key innovation of this work lies in using LLMs not merely for text generation but as a semantic bridge to enhance cross-domain knowledge alignment, thereby advancing the methodological toolkit for science-technology linkage studies. This approach offers potential contributions to knowledge mapping, thematic analysis, and strategic innovation management, particularly in areas where domain-specific terminology or conceptual divergence hampers conventional analyses. [Method/Process] The proposed method employs ChatGPT-4 as a knowledge-enriched intermediary to generate semantically enhanced textual variants of existing S&T documents in the energy-saving domain. Specifically, the LLM was used to perform synonym-based paraphrasing, expansion, and semantic inference on research paper abstracts and patent summaries, producing augmented texts that retain domain relevance while highlighting latent semantic connections. These enhanced texts were used to extract features that were subsequently incorporated into a non-patent citation prediction task, which serves as a practical evaluation of the method's effectiveness. By comparing predicted associations against existing citation links, the study assesses the capacity of LLM-derived features to capture cross-domain topic relatedness beyond lexical similarity. The approach relies on the theoretical premise that LLMs can model high-level semantic patterns, enabling the inference of conceptual correspondence even when explicit terminology differs between scientific and technological texts. [Results/Conclusions] The experimental validation process involved four baseline models, and it was found that features derived from the augmented texts consistently improved prediction performance. The area under the ROC curve (AUC) increased by 13.91%, 16.90%, 16.21%, and 15.69% across the four models, respectively, demonstrating the efficacy of LLM-based data augmentation in bridging the semantic gap between S&T knowledge. These results suggest that the method can uncover latent topic associations, facilitate cross-domain term alignment, and support knowledge discovery tasks that conventional lexical-based approaches may overlook. However, the study is limited by its focus on a single application domain, leaving open questions regarding generalizability across multiple S&T fields. Future work should extend the methodology to diverse domains, investigate the robustness of the LLM-generated semantic bridges, and explore automated mechanisms for scaling cross-domain knowledge integration. Overall, this research provides a promising framework for enhancing the semantic connectivity of heterogeneous knowledge sources. This contributes to a broader understanding of the interactions between science and technology and informs data-driven strategies for managing research and innovation.

Key words: large language model, data augmentation, S&T linkage, topic similarity

中图分类号:  G350,TP391

引用本文

王晓宇, 胡靖源, 巫若羽, 王舒, 翟羽佳. 基于大语言模型数据增强的“科学-技术”主题关联方法研究——以节能领域为例[J]. 农业图书情报学报, 2025, 37(9): 63-81.

WANG Xiaoyu, HU Jingyuan, WU Ruoyu, WANG Shu, ZHAI Yujia. An LLM-based Data Augmentation Method for Constructing Science & Technology Topic Linkages: Taking the Energy Conservation Field as an Example[J]. Journal of library and information science in agriculture, 2025, 37(9): 63-81.