农业图书情报学报 ›› 2024, Vol. 36 ›› Issue (9): 89-101.doi: 10.13998/j.cnki.issn1002-1248.24-0615

• 应用实践 • 上一篇    下一篇

数字人文与大语言模型:古文献语义检索实践与探索

王昊贤, 周子茗, 丁菲菲, 韦成府   

  1. 北京大学 图书馆,北京 100871
  • 收稿日期:2024-06-21 出版日期:2024-09-05 发布日期:2025-01-13
  • 作者简介:

    王昊贤(1991- ),硕士,馆员,研究方向为信息检索、计算机技术应用、软件开发

    周子茗(1998- ),硕士,助理馆员,研究方向为计算语言学、数字人文

    丁菲菲(1996- ),硕士,馆员,研究方向为检索与识别

    韦成府(1982- ),副研究馆员,计算服务中心副主任,研究方向为数字图书馆、软件开发

  • 基金资助:
    北京大学图书馆创新案例“用户导向赋能创新——计算服务生态系统建设”(6309900289/049)

Digital Humanities & Large Language Models: Practice and Research in Semantic Retrieval of Ancient Documents

Haoxian WANG, Ziming ZHOU, Feifei DING, Chengfu WEI   

  1. Peking University Library, Beijing 100871
  • Received:2024-06-21 Online:2024-09-05 Published:2025-01-13

摘要:

[目的/意义] 在人工智能技术,尤其是大语言模型持续升温的背景下,本文旨在探讨其在数字人文领域的应用,特别是古文献检索。研究不仅为数字人文带来新视角和方法,促进学术研究和文化传承,还为其他院校图书馆提供了实践参考,具有指导意义。 [方法/过程] 本文分析了数字人文与大语言模型在古文献检索中的创新应用和趋势,并讨论了大模型在古文献识别、语义理解和信息抽取等方面的潜力和优势,推动了古文献检索的变革。通过案例分析,构建了语义检索模型,并展示了大语言模型在元数据扩展、快速检索和个性化推荐等方面的应用,建立了古文献语义检索平台。 [结果/结论] 研究的主要优势和贡献在于规范化的元数据组织、大模型支持的元数据扩展、自然语言化的检索词支持、容错的检索机制、向量引擎高效检索等。但也存在不足,如大模型生成结果的准确性问题和对用户检索数据的全面分析不足。未来,将致力于改进这些问题,以提高研究效果。

关键词: 大语言模型, 数字人文, 古文献检索, 语义理解, 文化遗产

Abstract:

[Purpose/Significance] Against the backdrop of the increasing popularity of artificial intelligence technology, particularly large language models, this paper aims to explore their applications in the field of digital humanities, with a particular focus on the retrieval of ancient documents. Through the practice and exploration of the ancient document retrieval platform at Peking University Library, this study not only introduces new perspectives and methods to the field of digital humanities, but also promotes academic research and cultural heritage. It also provides practical references for other university libraries, which is an important guide. [Method/Process] The article begins with an overview of the origins and development of the digital humanities, emphasizing its central role in humanities research. The paper then examines the current state of the arts in large language models and analyzes their potential and advantages for identifying and classifying ancient documents, semantic understanding and parsing, and information extraction and association. Through the analysis of practical case studies, this paper constructs a fundamental semantic retrieval model, the core architecture of which consists of two critical components. First, the construction phase of the retrieval engine involves meticulous pre-processing of the ancient document information to generate basic metadata. Using large-scale models, these metadata are subjected to in-depth processing and enhancement to create auxiliary search fields and enriched text. In addition, the text processed by the model and the original text are transformed into semantic vectors, which are then stored in an efficient vector engine for rapid retrieval. Second, the search and sort component is another core part of the model. This part accurately processes the user's search terms through large models to generate extended content and, in conjunction with the search terms, creates accurate semantic vectors. Utilizing the previously constructed vector engine, the model can efficiently retrieve relevant documents and intelligently sort the search results based on specific algorithms, ensuring that users can quickly obtain the most relevant and valuable information. Taking the ancient document system collection data of Peking University Library as the research object, the paper processes over 250,000 records, primarily consisting of ancient books and rubbings, as well as over 10 million metadata items. Using the gradio framework on a server equipped with two NVIDIA RTX 4090 24G graphics cards, a semantic retrieval platform was created to effectively manage and retrieve these vast amounts of data. [Results/Conclusions] The main strengths and contributions of the study lie in the standardized metadata organization, the metadata extension supported by large models, the support for natural language search terms, the fault-tolerant search mechanisms, and the efficient retrieval capabilities of the vector engine. However, there are shortcomings, such as the accuracy of results generated by large models and insufficient comprehensive analysis of user search data. Future efforts will be devoted to improving these issues to increase the effectiveness of the research.

Key words: large language models, digital humanities, ancient document retrieval, semantic understanding, cultural heritage

中图分类号:  G252.7

引用本文

王昊贤, 周子茗, 丁菲菲, 韦成府. 数字人文与大语言模型:古文献语义检索实践与探索[J]. 农业图书情报学报, 2024, 36(9): 89-101.

Haoxian WANG, Ziming ZHOU, Feifei DING, Chengfu WEI. Digital Humanities & Large Language Models: Practice and Research in Semantic Retrieval of Ancient Documents[J]. Journal of Library and Information Science in Agriculture, 2024, 36(9): 89-101.