农业图书情报学报 ›› 2024, Vol. 36 ›› Issue (5): 32-42.doi: 10.13998/j.cnki.issn1002-1248.24-0346

• 研究论文 • 上一篇    下一篇

大语言模型在分类标引工作中的应用探索

姜鹏, 任龑, 朱蓓琳   

  1. 上海图书馆,上海 200030
  • 收稿日期:2024-04-13 发布日期:2024-09-24
  • 作者简介:姜鹏(1986- ),男,硕士,工程师,研究方向为分类标引、数字人文。任龑(1990- ),男,硕士,助理馆员,研究方向为分类标引、数字人文。朱蓓琳(1992- ),女,硕士,馆员,研究方向为分类标引、数字人文
  • 基金资助:
    上海图书馆“2151工程”项目“AIGC服务辅助文献标引的适用性评价”

Exploration and Practice of Classification Indexing Combined with Large Language Models

JIANG Peng, REN Yan, ZHU Beiling   

  1. Shanghai Library, Shanghai 200030
  • Received:2024-04-13 Published:2024-09-24

摘要: [目的/意义]文献分类标引是图书馆等信息机构基础工作之一,目前有限的人工难以类分数量庞大的文献。大语言模型以优异的自然语言理解和处理能力,被用于完成诸如文本生成、自动摘要、文本分类等相关自然语言任务,能够与文献标引全过程相结合,有助于缓解分类标引压力。[方法/过程]结合《全国报刊索引》长期工作实践,从减轻标引人员阅读压力、大语言模型直接用于分类以及和自动标引模型相结合为切口,探索如何将大语言模型引入分类标引工作环节,以提高标引效率。[结果/结论]通过一系列对比测试和分析,设计Prompt辅助主题分类模型以及ACBKSY自动标引模型。Prompt辅助主题分类模型标引人员快速了解文献重点,减少阅读压力。ACBKSY模型整体分类准确率提高了2.16%,非拒绝准确率提高了3.77%。在此基础上优化实际标引工作流程,目前此流程已在R、F大类文献标引中投入使用,经优化后的工作流程可以提高标引效率1.1~1.4倍。

关键词: 分类标引, 大语言模型, 文心一言, GPT-4

Abstract: [Purpose/Significance] Document classification is one of the fundamental tasks of information service institutions such as libraries. The limited human resources make it challenging to categorize the vast number of documents, and the current automatic indexing technologies are not yet fully integrated into the entire indexing process. Large language models (LLMs), with their excellent capabilities in natural language understanding and processing capabilities, have been utilized for various natural language processing tasks such as text generation, automatic summarization, and text classification, which can be integrated into the entire classification process. [Method/Process] Combining the long-term practical experience of the National Newspaper Index, the research on how to introduce LLMs into the classification and indexing process is conducted from three aspects: reducing the reading pressure on indexers, directly using LLMs for classification, and combining them with automatic indexing models. A prompt-assisted topic classification model is designed to leverage the LLM for intelligent analysis and extraction of document content, guiding the model to output concise information summaries. This allows indexers to quickly understand the basic situation of the research, grasp the essence of key concepts and their interrelationships, and thus quickly and accurately determine how to classify the collections. [Results/Conclusions] When the LLM cannot be directly used for text classification tasks based on the "Chinese Library Classification" (CLC), it is combined with existing automatic models to generate the ACBKSY model. The overall classification accuracy of the model has improved by 2.16%, and the non-rejection accuracy has increased by 3.77%. On this basis, the actual indexing workflow is optimized to increase the systematicity and coherence of the indexing work, ensuring that every step from document input to final classification is more efficient and accurate. This optimized workflow has been put into use in the R and F categories of the collection, and it can improve the efficiency of indexing by 1.1 to 1.4 times. However, there are still some shortcomings in this paper, such as not providing the LLM with sufficient learning to fully understand the category settings of the CLC and some simple rule divisions; the classification based on the CLC is essentially a hierarchical classification, and how to guide the LLM to gradually output classification results in the form of multiple rounds of dialogue needs further study.

Key words: automatic indexing, large language model (LLM), ERNIE bot, GPT-4

中图分类号:  G250.7

引用本文

姜鹏, 任龑, 朱蓓琳. 大语言模型在分类标引工作中的应用探索[J]. 农业图书情报学报, 2024, 36(5): 32-42.

JIANG Peng, REN Yan, ZHU Beiling. Exploration and Practice of Classification Indexing Combined with Large Language Models[J]. Journal of Library and Information Science in Agriculture, 2024, 36(5): 32-42.