农业图书情报学报 ›› 2021, Vol. 33 ›› Issue (9): 93-103.doi: 10.13998/j.cnki.issn1002-1248.21-0237

• 研究论文 • 上一篇    

基于PDF版式特征的文献篇章结构细粒度抽取方法研究

赵婉婧1,2, 刘敏娟1,2,*, 刘洪冰1,2, 王新1,2, 段飞虎3   

  1. 1.中国农业科学院农业信息研究所,北京 100081;
    2.农业农村部 农业大数据重点实验室,北京 100081;
    3.同方知网数字出版技术股份有限公司,北京 100192
  • 收稿日期:2021-04-01 出版日期:2021-09-05 发布日期:2021-09-28
  • 通讯作者: * 刘敏娟(ORCID:0000-0001-8422-2919),女,博士,副研究员,资源建设部主任,研究方向为信息资源建设。Email:liuminjuan@caas.cn
  • 作者简介:赵婉婧(ORCID:0000-0001-7345-0895),女,硕士,助理研究员,研究方向为信息资源建设与数据加工。刘洪冰(1990- ),女,硕士,助理研究员,研究方向为信息资源建设。王新(1986- ),男,博士,馆员,研究方向为数字资源管理与信息组织。段飞虎(1983- ),男,硕士,高级工程师,研究方向为大数据自然语言
  • 基金资助:
    中国农业科学院科技创新工程(CAAS-ASTIP-2016-AII)

A Fine-grained Extraction Method of Chapter Structure of Documents Based on PDF Layout Features

ZHAO Wanjing1,2, LIU Minjuan1,2,*, LIU Hongbing1,2, WANG Xin1,2, DUAN Feihu3   

  1. 1. Agricultural Information Institute of CAAS, Beijing 100081;
    2. Key Laboratory of Agricultural Big Data, Ministry of Agriculture and Rural Affairs, Beijing 100081;
    3. Tongfang Knowledge Network Digital Publishing Technology Co., Ltd., Beijing 100192
  • Received:2021-04-01 Online:2021-09-05 Published:2021-09-28

摘要: [目的/意义]为实现文献资源的细粒度组织,满足用户日趋精准的信息服务需求,研究提出一种基于PDF版式特征的文献篇章结构细粒度自动抽取方法。[方法/过程]方法充分利用机器学习在信息分类方面的优势,针对非结构化的PDF文档,基于其版式特征对章节标题进行自动分析、识别与抽取。根据章节标题的坐标定位,将正文内容以段落为最小颗粒度自动匹配至所属标题的下级位置,最终实现文档全文结构的细粒度抽取和重组。[结果/结论]经实测,机器自动抽取平均正确率达80%,针对非结构化PDF文档的细粒度抽取工作具有较好的现实意义和应用前景,基于底层方法设计构建的数据处理系统现已投入实际应用,大幅解放人工进行篇章结构细粒度抽取的工作。

关键词: 版式特征, 篇章结构, 章节标题, 细粒度抽取, 机器学习

Abstract: [Purpose/Significance] This paper proposes a fine-grained automatic extraction method for document structure based on PDF layout features, in order to realize fine-grained organization of literature resources and meet the increasingly growing needs of users for accurate information services. [Method/Process] The method takes full advantage of machine learning in information classification, which can automatically analyze, identify and extract the chapter title of unstructured PDF documents based on layout features. And according to the coordinate positioning of chapter titles, the body content is automatically matched to the subordinated position of the title with paragraph as the minimum granularity, and the fine-grained extraction and identification of the full text of the document is finally realized. [Results/Conclusions] The test result shows that the average accuracy of automatic extraction can reach 80%. The method of fine-grained extraction of unstructured PDF documents proposed has practical significance and application prospect, and the data processing system designed based on the underlying method has been put into practical application, which will greatly liberate us from the mechanical drudgery of chapter structure extraction tasks.

Key words: layout features, chapter structure, chapter title, fine-grained extraction, machine learning

中图分类号: 

  • G250

引用本文

赵婉婧, 刘敏娟, 刘洪冰, 王新, 段飞虎. 基于PDF版式特征的文献篇章结构细粒度抽取方法研究[J]. 农业图书情报学报, 2021, 33(9): 93-103.

ZHAO Wanjing, LIU Minjuan, LIU Hongbing, WANG Xin, DUAN Feihu. A Fine-grained Extraction Method of Chapter Structure of Documents Based on PDF Layout Features[J]. Journal of Library and Information Science in Agriculture, 2021, 33(9): 93-103.