中文    English

Journal of Library and Information Science in Agriculture ›› 2021, Vol. 33 ›› Issue (9): 93-103.doi: 10.13998/j.cnki.issn1002-1248.21-0237

Previous Articles    

A Fine-grained Extraction Method of Chapter Structure of Documents Based on PDF Layout Features

ZHAO Wanjing1,2, LIU Minjuan1,2,*, LIU Hongbing1,2, WANG Xin1,2, DUAN Feihu3   

  1. 1. Agricultural Information Institute of CAAS, Beijing 100081;
    2. Key Laboratory of Agricultural Big Data, Ministry of Agriculture and Rural Affairs, Beijing 100081;
    3. Tongfang Knowledge Network Digital Publishing Technology Co., Ltd., Beijing 100192
  • Received:2021-04-01 Online:2021-09-05 Published:2021-09-28

Abstract: [Purpose/Significance] This paper proposes a fine-grained automatic extraction method for document structure based on PDF layout features, in order to realize fine-grained organization of literature resources and meet the increasingly growing needs of users for accurate information services. [Method/Process] The method takes full advantage of machine learning in information classification, which can automatically analyze, identify and extract the chapter title of unstructured PDF documents based on layout features. And according to the coordinate positioning of chapter titles, the body content is automatically matched to the subordinated position of the title with paragraph as the minimum granularity, and the fine-grained extraction and identification of the full text of the document is finally realized. [Results/Conclusions] The test result shows that the average accuracy of automatic extraction can reach 80%. The method of fine-grained extraction of unstructured PDF documents proposed has practical significance and application prospect, and the data processing system designed based on the underlying method has been put into practical application, which will greatly liberate us from the mechanical drudgery of chapter structure extraction tasks.

Key words: layout features, chapter structure, chapter title, fine-grained extraction, machine learning

CLC Number: 

  • G250
[1] 陈燕方. 基于多粒度的图书馆知识服务创新[J]. 数字图书馆论坛, 2018, 3: 25-30.
CHEN Y F.Library knowledge service innovation based on multi-granularity[J]. Digital library forum, 2018, 3: 25-30.
[2] 李伟. 基于知识元细粒度信息检索研究[J]. 农业图书情报学刊, 2017, 29(2): 12-15.
LI W.Research on fine-grained information retrieval based on knowledge element[J]. Journal of library and information sciences in agriculture, 2017, 29(2): 12-15.
[3] 冯儒佳, 王忠义, 王艳凤, 等. 科技论文的多粒度知识组织框架研究[J]. 情报科学, 2016, 34(12): 46-50, 54.
FENG R J, WANG Z Y, WANG Y F, et al.Research on multi-granularity knowledge organization framework of scientific and technological papers[J]. Information science, 2016, 34(12): 46-50, 54.
[4] 赵鹏. 科技期刊数字化出版建设实践——以金属矿山杂志社为例[J]. 中国科技期刊研究, 2016, 27(7): 763-766.
ZHAO P.Practice of the digital construction of scientific journals: A case study of the metal mine magazine[J]. Chinese journal of scientific and technical periodicals, 2016, 27(7): 763-766.
[5] 尹军. 数字时代期刊媒体编辑出版创新路径探析[J]. 新闻研究导刊, 2021, 12(9): 216-218.
YIN J.Analysis on the innovation path of journal media editing and publishing in digital era[J]. Journal of news research, 2021, 12(9): 216-218.
[6] 白杰, 杨爱臣. XML结构化数字出版的特点与流程[J]. 出版广角, 2015, 5: 28-31.
BAI J, YANG A C.Features and process of XML structured digital publishing[J]. View on publishing, 2015, 5: 28-31.
[7] 谈春梅, 段卫华. 特种文献数据库系统关键技术的研究与实现[J]. 现代图书情报技术, 2002, 6: 52-54.
TAN C M, DUAN W H.Research and realization of key techniques of special literature database system[J]. Data analysis and knowledge discovery, 2002, 6: 52-54.
[8] 孙坦, 丁培, 黄永文, 等. 文本挖掘技术在农业知识服务中的应用述评[J]. 农业图书情报学报, 2021, 33(1): 4-16.
SUN T, DING P, HUANG Y W, et al.Review on the application and development strategies of text mining in agriculture knowledge services[J]. Journal of library and information sciences in agriculture, 2021, 33(1): 4-16.
[9] 曹树金, 李洁娜, 王志红. 面向网络信息资源聚合搜索的细粒度聚合单元元数据研究[J]. 中国图书馆学报, 2017, 43(230): 74-92.
CAO S J,LI J N, WANG Z H.Research on the meta-data schema for fine-grained aggregation units of internet resources[J]. Journal of library science in China, 2017, 43(230): 74-92.
[10] 陆伟, 黄永, 程齐凯. 学术文本的结构功能识别——功能框架及基于章节标题的识别[J]. 情报学报, 2014, 33(9): 979-985.
LU W, HUANG Y, CHENG Q K.The structure function of academic text and its classification[J]. Journal of the China society for scientific and technical information, 2014, 33(9): 979-985.
[11] 万里鹏. 非结构化到结构化数据转换的研究与实现[D]. 成都: 西南交通大学, 2013.
WAN L P.Research and implementation of the transformation from unstructured to structured data[D]. Chengdu: southwest Jiaotong university, 2013.
[12] 宋艳娟. 基于XML的HTML和PDF信息抽取技术的研究[D]. 福州: 福州大学, 2005.
SONG Y J.Research on the HTML and PDF Information extraction technology based XML[D]. Fuzhou: Fuzhou university, 2005.
[13] MARINAI S, GORI M, FELLOW, et al. Artificial neural networks for document analysis and recognition[J]. IEEE transactions on pattern analysis and machine intelligence, 2005, 27(1): 23-35.
[14] MINH-THANG L, THUY D N, MIN-YEN K.Logical structure re-covery in scholarly articles with rich document features[J]. Interna-tional journal of digital library systems, 2010, 1(4): 1-23.
[15] 段飞虎, 吴盼盼, 冯自强, 等. 一种基于机器学习的论文碎片化信息抽取方法[P]. CN108536683A, 2018-09-14.
DUAN F H, WU P P, FENG Z Q, et al. A method for fragmentation extraction of paper based on machine learning[P]. CN108536683A, 2018-09-14.
[16] 张昊珗. 非结构化文档的版面分析及表格提取[D]. 北京: 北京交通大学, 2019.
ZHANG H Y.Layout analysis and table extraction in unstructured documents[D]. Beijing: Beijng Jiaotong university, 2019.
[17] 聂维民, 陈永洲, 马静. 融合多粒度信息的文本向量表示模型[J]. 数据分析与知识发现, 2019, 9: 45-52.
NIE W M, CHEN Y Z, MA J.A Text vector representation model merging multi-granularity information[J]. Data analysis and knowledge discovery, 2019, 9: 45-52.
[18] 徐浩, 朱学芳, 章成志, 等. 面向学术文献全文本的方法论知识抽取系统分析与设计[J]. 数据分析与知识发现, 2019, 10: 29-36.
XU H, ZHU X F, ZHANG C Z, et al.System analysis and design for methodological entities extraction in full text of academic literature[J]. Data analysis and knowledge discovery, 2019, 10: 29-36.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!