农业图书情报 ›› 2019, Vol. 31 ›› Issue (6): 21-30.doi: 10.13998/j.cnki.issn1002-1248.2019.06.19-0550

• 研究论文 • 上一篇    下一篇

基于主题N元语法模型的科技报告主题分析

安欣1, 徐硕2   

  1. 1. 北京林业大学 经济与管理学院,北京 100083;
    2. 北京工业大学 经济与管理学院 北京现代制造业发展研究基地,北京 100124
  • 收稿日期:2019-06-24 出版日期:2019-06-05 发布日期:2019-08-02
  • 通讯作者: 徐硕(1979-),男,ORCID:0000-0002-8602-1819,博士,教授,北京工业大学,研究方向:科学前沿探测、技术预见、大数据和数据挖掘等,E-mail: xushuo@bjut.edu.cn。
  • 作者简介:安欣(1980-),女,博士,副教授,北京林业大学,研究方向:科学计量、知识发现、知识组织与管理等,E-mail: anxin@bjfu.edu.cn。
  • 基金资助:
    广东省自然科学基金项目“面向生物医药领域的前沿技术预判方法论与模型构建研究”(项目编号:2018A030313695)

Topical Analysis of Scientific and Technical Reports based on Topical N-Grams Model

AN Xin1, XU Shuo2   

  1. 1. School of Economics and Management, Beijing Forestry University, Beijing 100083, China;
    2. Research Base of Beijing Modern Manufacturing Development, College of Economics and Management, Beijing University of Technology, Beijing 100124, Chin
  • Received:2019-06-24 Online:2019-06-05 Published:2019-08-02

摘要: 作为科技情报的重要载体之一,科技报告可以反映科技发展的脉络,可以揭示科技前沿的动态,甚至可以洞察科技发展的趋势等。中国科技报告的开发利用研究目前主要集中在书本型科技报告或电子出版物的出版发行、数据库建设、服务方式和知识产权等方面,在深度数据挖掘方面的研究工作相对较少。笔者尝试利用主题N元语法模型对科技报告进行领域深层主题分析,为了确定特定领域科技报告的主题数目,笔者借助动态规划的思想针对主题N元语法模型提出了困惑度的有效计算方法。最后,以肿瘤领域

关键词: 科技报告, 主题N元语法模型, 主题分析, 困惑度, 热力图

Abstract: As one of the important carriers of scientific & technical (S&T) intelligence, S&T reports can reflect the line of S&T development, recover the latest news of S&T fronts, and even insight the trends of S&T development. Researches on developing and utilizing S&T reports in our country mainly focus on the following: publication and distribution of S&T reports in the form of book and electrical publication; database construction; service mode; intelligent property and so on. The deep data mining on S&T reports remains largely under-studied. This work tries to discover the domain latent topics of S&T reports with the topical n-grams model. In order to determine the number of topics of S&T reports for some specific domain, the calculation method of perplexity of the topic n-grams model is put forward with the dynamic programming in this study. Finally, 70 domain topics are discovered from 1 344 S&T reports in the tumor domain, such as "molecular mechanisms/tumor cells", "system biology/key methods" and so on. Experimental results show that it is feasible and efficient to discover the latent topics from S&T reports with the topical n-grams model.

Key words: scientific and technical reports, topical n-grams model, topical analysis, perplexity, heat map

中图分类号: 

  • G322

引用本文

安欣, 徐硕. 基于主题N元语法模型的科技报告主题分析[J]. 农业图书情报, 2019, 31(6): 21-30.

AN Xin, XU Shuo. Topical Analysis of Scientific and Technical Reports based on Topical N-Grams Model[J]. Agricultural Library and Information, 2019, 31(6): 21-30.