中文    English

Journal of Library and Information Science in Agriculture ›› 2024, Vol. 36 ›› Issue (5): 32-42.doi: 10.13998/j.cnki.issn1002-1248.24-0346

Previous Articles     Next Articles

Exploration and Practice of Classification Indexing Combined with Large Language Models

JIANG Peng, REN Yan, ZHU Beiling   

  1. Shanghai Library, Shanghai 200030
  • Received:2024-04-13 Published:2024-09-24

Abstract: [Purpose/Significance] Document classification is one of the fundamental tasks of information service institutions such as libraries. The limited human resources make it challenging to categorize the vast number of documents, and the current automatic indexing technologies are not yet fully integrated into the entire indexing process. Large language models (LLMs), with their excellent capabilities in natural language understanding and processing capabilities, have been utilized for various natural language processing tasks such as text generation, automatic summarization, and text classification, which can be integrated into the entire classification process. [Method/Process] Combining the long-term practical experience of the National Newspaper Index, the research on how to introduce LLMs into the classification and indexing process is conducted from three aspects: reducing the reading pressure on indexers, directly using LLMs for classification, and combining them with automatic indexing models. A prompt-assisted topic classification model is designed to leverage the LLM for intelligent analysis and extraction of document content, guiding the model to output concise information summaries. This allows indexers to quickly understand the basic situation of the research, grasp the essence of key concepts and their interrelationships, and thus quickly and accurately determine how to classify the collections. [Results/Conclusions] When the LLM cannot be directly used for text classification tasks based on the "Chinese Library Classification" (CLC), it is combined with existing automatic models to generate the ACBKSY model. The overall classification accuracy of the model has improved by 2.16%, and the non-rejection accuracy has increased by 3.77%. On this basis, the actual indexing workflow is optimized to increase the systematicity and coherence of the indexing work, ensuring that every step from document input to final classification is more efficient and accurate. This optimized workflow has been put into use in the R and F categories of the collection, and it can improve the efficiency of indexing by 1.1 to 1.4 times. However, there are still some shortcomings in this paper, such as not providing the LLM with sufficient learning to fully understand the category settings of the CLC and some simple rule divisions; the classification based on the CLC is essentially a hierarchical classification, and how to guide the LLM to gradually output classification results in the form of multiple rounds of dialogue needs further study.

Key words: automatic indexing, large language model (LLM), ERNIE bot, GPT-4

CLC Number: 

  • G250.7
[1] 国家图书馆《中国图书馆分类法》编辑委员会. 《中国图书馆分类法》第五版使用手册[M]. 北京: 国家图书馆出版社, 2012: 13.
Editorial Committee of the "Chinese Library Classification" of the National Library. Manual of Chinese library classification[M]. Beijing: National Library of China Publishing House, 2012: 13.
[2] 何琳, 刘竟, 侯汉清. 基于《中图法》的多层自动标引影响因素分析[J]. 中国图书馆学报, 2009, 35(6): 49-55.
HE L, LIU J, HOU H Q.An analysis of the impact factors in the multi-layer automatic classification based on CLC[J]. Journal of library science in China, 2009, 35(6): 49-55.
[3] 刘晓明, 李丞正旭, 吴少聪, 等. 文本分类算法及其应用场景研究综述[J/OL]. 计算机学报, 2024: 1-44. http://kns.cnki.net/kcms/detail/11.1826.TP.20240229.1608.002.html.
LIU X M, LI C Z X, WU S C, et al. A survey of text classification algorithms and application scenarios[J/OL]. Chinese journal of computers, 2024: 1-44. http://kns.cnki.net/kcms/detail/11.1826.TP.20240229.1608.002.html.
[4] 史雅莉, 贺红钰. 2003-2023年我国自动标引研究及实践进展[J]. 情报探索, 2024(4): 120-127.
SHI Y L, HE H Y. Research and practice progress of automatic indexing in China from2003 to 2023[J]. Information research, 2024(4): 120-127.
[5] 罗宏宇, 刘伟. 基于语义层级细粒度的海量文献标引研究[J]. 情报理论与实践, 2024, 47(5): 194-203, 193.
LUO H Y, LIU W.Research on massive literature indexing based on semantic hierarchy granularity[J]. Information studies (theory & application), 2024, 47(5): 194-203, 193.
[6] 沈立力, 姜鹏, 王静. 基于BERT模型的中文期刊文献自动分类实践研究[J]. 图书馆杂志, 2022, 41(5): 109-118, 135.
SHEN L L, JIANG P, WANG J.A study on the automatic classification of Chinese literature in periodicals based on BERT model[J]. Library journal, 2022, 41(5): 109-118, 135.
[7] 张雨卉. 基于《中国图书馆分类法》的文献自动化深层分类的研究和实现[J]. 图书馆杂志, 2024, 43(3):6 1-74.
ZHANG Y H.A study of automated deep classification of literature based on Chinese library classification[J]. Library journal, 2024, 43(3): 61-74.
[8] 赵鑫, 窦志成, 文继荣. 大语言模型时代下的信息检索研究发展趋势[J]. 中国科学基金, 2023, 37(5): 786-792.
ZHAO X, DOU Z C, WEN J W.The development of information retrieval in the era of large language model[J]. Bulletin of national natural science foundation of China, 2023, 37(5): 786-792.
[9] 符荣鑫, 杨小华. AIGC语言模型分析及其高校图书馆应用场景研究[J]. 农业图书情报学报, 2023(7): 27-38.
FU R X, YANG X H.Analysis of AIGC language models and application scenarios in university libraries[J]. Journal of library and information science in agriculture, 2023(7): 27-38.
[10] 张智雄, 曾建勋, 夏翠娟, 等.回应AIGC的信息资源管理学人思考[J]. 农业图书情报学报, 2023, 35(1): 4-28.
ZHANG Z X,ZENG J X,XIA C J, et al.Information resource management researchers' thinking about the opportunities and challenges of AIGC[J]. Journal of library and information science in agriculture, 2023, 35(1): 4-28.
[11] 王静静, 叶鹰, 王婉茹. ChatGPT类AI-GPT技术应用对图书馆信息处理的变革探析[J]. 图书馆理论与实践, 2024(1): 122-127, 136.
WANG J J, YE Y, WANG W R.A prospective analysis on ChatGPT-type AI-GPT technical applications for Chang-ing library information processing[J]. Library theory and practice, 2024(1): 122-127, 136.
[12] WEI X, CUI X Y, CHENG N, et al.ChatIE: Zero-shot information extraction via chatting with ChatGPT[J/OL]. ar Xiv Preprint, arXiv:
2302.10205, 2023.
[13] 孟旭阳, 陈阳, 白海燕. 面向检索结果集的结构化综述智能生成研究[J]. 图书情报工作, 2024, 68(6): 129-141.
MENG X Y, CHEN Y, BAI H Y.Research on intelligent generation of structured review for retrieval result set[J]. Library and information service, 2024, 68(6): 129-141.
[14] 戎璐. 面向图书自动分类的大语言模型提示学习研究[J]. 图书馆学研究, 2024(1): 86-103.
RONG L.A research on prompt learning of large language models for automated book classification[J]. Research on library science, 2024(1): 86-103.
[15] 李敬灿, 肖萃林, 覃晓婷, 等. 基于大语言模型与语义增强的文本关系抽取算法[J]. 计算机工程, 2024, 50(4): 87-94.
LI J C, XIAO C L, QIN X T, et al.Text-relation-extraction algorithm based on large-language model and semantic enhancement[J]. Computer engineering, 2024, 50(4): 87-94.
[16] 李诗晨, 王中卿, 周国栋. 大语言模型驱动的跨域属性级情感分析[J/OL]. 软件学报, 2024: 1-16. https://doi.org/10.13328/j.cnki.jos.007156.
LI C X, WANG Z Q, ZHOU G D. LLM enhanced cross domain aspect-based sentiment analysis[J/OL]. Journal of software, 2024: 1-16. https://doi.org/10.13328/j.cnki.jos.007156.
[17] 杨冬菊, 黄俊涛. 基于大语言模型的中文科技文献标注方法[J/OL]. 计算机工程, 2024: 1-7. https://doi.org/10.19678/j.issn.1000-3428.0068400.
YANG D J, HUANG J T. A Chinese scientific literature annotation method based on large language model[J/OL]. Computer engineering, 2024: 1-7. https://doi.org/10.19678/j.issn.1000-3428.0068400.
[18] 许志伟, 李海龙, 李博, 等. AIGC大模型测评综述: 使能技术, 安全隐患和应对[J/OL]. 计算机科学与探索, 2024: 1-34. http://kns.cnki.net/kcms/detail/11.5602.tp.20240523.1947.002.html.
XU Z W, LI H L, LI B, et al. A Survey of AIGC model evaluation: Enabling technologies, vulnerabilities and mitigation[J/OL]. Journal of frontiers of computer science and technology, 2024: 1-34. http://kns.cnki.net/kcms/detail/11.5602.tp.20240523.1947.002.html.
[19] 张华平, 李林翰, 李春锦. ChatGPT中文性能测评与风险应对[J]. 数据分析与知识发现, 2023, 7(3): 16-25.
ZHANG H P, LI L H, LI C J.ChatGPT performance evaluation on Chinese language and risk measures[J]. Data analysis and knowledge discovery, 2023, 7(3): 16-25.
[20] 张颖怡, 章成志, 周毅, 等. 基于ChatGPT的多视角学术论文实体识别: 性能测评与可用性研究[J]. 数据分析与知识发现, 2023, 7(9): 12-24.
ZHANG Y Y, ZHANG C Z, ZHOU Y, et al.ChatGPT-based scientific paper entity recognition: Performance measurement and availability research[J]. Data analysis and knowledge discovery, 2023, 7(9): 12-24.
[21] 赵磊, 章成志. 基于不同内容层面的特定领域研究主题差异分析研究[J]. 农业图书情报学报, 2021, 33(5): 14-27.
ZHAO L, ZHANG C Z.Difference analysis of research topics in a specific domain based on different content levels[J]. Journal of library and information science in agriculture, 2021, 33(5): 14-27.
[22] 姜鹏. 基于BERT的《中图法》文本分类系统及其影响因素分析[J]. 图书馆研究与工作, 2022(5): 43-48.
JIANG P.A case study of the BERT model based on Chinese library classification and influence factors[J]. Library science research & work, 2022(5): 43-48.
[1] QIAN Li, LIU Zhibo, HU Maodi, CHANG Zhijun. Construction Model of AI-Ready for Scientific and Technological Intelligence Data Resources [J]. Journal of Library and Information Science in Agriculture, 2024, 36(3): 32-45.
[2] FU Rongxin, YANG Xiaohua. Analysis of AIGC Language Models and Application Scenarios in University Libraries [J]. Journal of Library and Information Science in Agriculture, 2023, 35(7): 27-38.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!