农业图书情报 ›› 2019, Vol. 31 ›› Issue (4): 19-28.doi: 10.13998/j.cnki.issn1002-1248.2019.03.19-0342

• 研究论文 • 上一篇    下一篇

文本预处理后的LDA模型主题发现与技术演进研究

王丽1,2, 沈湘1,2   

  1. 1.中国科学院文献情报中心,北京 100190;
    2.中国科学院大学经济与管理学院图书情报与档案管理系,北京100190
  • 收稿日期:2019-04-24 出版日期:2019-04-05 发布日期:2019-06-21
  • 通讯作者: 王丽 (1982-),女,ORCID:0000-0002-9513-6159,中国科学院文献情报中心,副研究馆员,研究方向:信息分析、文本挖掘、推荐系统、自然语言处理。
  • 作者简介:王丽 (1982-),女,ORCID:0000-0002-9513-6159,中国科学院文献情报中心,副研究馆员,研究方向:信息分析、文本挖掘、推荐系统、自然语言处理。沈湘(1985-),女,ORCID:0000-0002-0682-7530,中国科学院文献情报中心,助理研究员,研究方向:学科情报与战略情报研究。
  • 基金资助:
    NSTL基金项目“面向国家重点研发计划的专题情报服务”(项目编号:2018XM06)

Research of Topics Discovery and Tech Evolution Based on Text Preprocessed LDA Model

WANG Li1,2, SHEN Xiang1,2   

  1. 1.National Science Library, Chinese Academy of Sciences, Beijing 100190, China;
    2.Department of Library, Information and Archives Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190, China
  • Received:2019-04-24 Online:2019-04-05 Published:2019-06-21

摘要: [目的]在科技情报资源快速增长的环境下,通过大文本数据分析快速发现研究主题,且进一步挖掘各研究主题下的技术发展与变化,对做出全面快速响应的科技情报工作有着重要的意义。[方法]针对大文本数据,利用Python实现了文本预处理后的LDA模型主题发现与技术演进,首先构建文本预处理泛化模型,实现技术词自动识别处理;然后基于技术词进行LDA模型构建及可视化,来识别研究主题;最后基于技术词构建技术演进的计算模型,来进一步挖掘技术的发展与变化。[结果]文章以SiC技术领域43 621项专利为分析对象进行了实践,包括文本预处理、主题发现及可视化、某主题下技术发展和变化分析等全流程,处理畅通且用时很短(案例全程历时约10分钟)。[局限] 文章提出的LDA各主题下技术演进模型中,文档只与其相关度最大的主题关联,尚未对文档多主题关联情况下的演进效果进行对比,后续有待进一步优化验证。[结论]文章提出的方法对快速全面把握一个科技领域有着重要作用,通过主题的识别以及主题之下的技术发展变化,可以以不同的颗粒度去研究一个科技领域,并对后续的调研分析提供有价值的线索。

关键词: LDA模型, 技术演进, 文本预处理, 可视化, 技术词自动识别

Abstract: [Objective] Computational science and Data Science are inspiring the intelligent analysis and information service today. Machine learning text analysis methods is changing the traditional analysis methods. This article discuss the benefits of unsupervised learning approaches in patent text mining. [Methods] Patent data of SiC industry were preprocessed by filter model based on NLTK Toolkit to identify the tech terms and then clustered based on Latent Dirichlet Allocation model to find the latent topics which were visualized. Based on group operation Top terms ranked by tf-idf through every year were used to reveal the R&D focus evolution. [Results] This research offers a demonstration of the proposed method based on 43,621 SiC patents. The results show 28 Research and Development topics with tech terms in SiC industry and present a Research and Development focus evolution based new emerging terms of every year which provides a clue for more detail analyses later. Finally,we discuss the clues for the R&D focus in the SiC industry.[Limitation]Multi Topics for documents were not compared for the R&D focus evolution in this article. That will be discussed in future. [Conclusions]The results show a efficent way to find technology focus evolution from a large scale text data.

Key words: LDA model, tech evolution, preprocessed text, visualization, automatic term identification

中图分类号: 

  • TP393

引用本文

王丽, 沈湘. 文本预处理后的LDA模型主题发现与技术演进研究[J]. 农业图书情报, 2019, 31(4): 19-28.

WANG Li, SHEN Xiang. Research of Topics Discovery and Tech Evolution Based on Text Preprocessed LDA Model[J]. Agricultural Library and Information, 2019, 31(4): 19-28.