Digital Humanities & Large Language Models: Practice and Research in Semantic Retrieval of Ancient Documents

Haoxian WANG, Ziming ZHOU, Feifei DING, Chengfu WEI   

  1. Peking University Library, Beijing 100871
  • Received:2024-06-21 Online:2024-09-05 Published:2025-01-13


[Purpose/Significance] Against the backdrop of the increasing popularity of artificial intelligence technology, particularly large language models, this paper aims to explore their applications in the field of digital humanities, with a particular focus on the retrieval of ancient documents. Through the practice and exploration of the ancient document retrieval platform at Peking University Library, this study not only introduces new perspectives and methods to the field of digital humanities, but also promotes academic research and cultural heritage. It also provides practical references for other university libraries, which is an important guide. [Method/Process] The article begins with an overview of the origins and development of the digital humanities, emphasizing its central role in humanities research. The paper then examines the current state of the arts in large language models and analyzes their potential and advantages for identifying and classifying ancient documents, semantic understanding and parsing, and information extraction and association. Through the analysis of practical case studies, this paper constructs a fundamental semantic retrieval model, the core architecture of which consists of two critical components. First, the construction phase of the retrieval engine involves meticulous pre-processing of the ancient document information to generate basic metadata. Using large-scale models, these metadata are subjected to in-depth processing and enhancement to create auxiliary search fields and enriched text. In addition, the text processed by the model and the original text are transformed into semantic vectors, which are then stored in an efficient vector engine for rapid retrieval. Second, the search and sort component is another core part of the model. This part accurately processes the user's search terms through large models to generate extended content and, in conjunction with the search terms, creates accurate semantic vectors. Utilizing the previously constructed vector engine, the model can efficiently retrieve relevant documents and intelligently sort the search results based on specific algorithms, ensuring that users can quickly obtain the most relevant and valuable information. Taking the ancient document system collection data of Peking University Library as the research object, the paper processes over 250,000 records, primarily consisting of ancient books and rubbings, as well as over 10 million metadata items. Using the gradio framework on a server equipped with two NVIDIA RTX 4090 24G graphics cards, a semantic retrieval platform was created to effectively manage and retrieve these vast amounts of data. [Results/Conclusions] The main strengths and contributions of the study lie in the standardized metadata organization, the metadata extension supported by large models, the support for natural language search terms, the fault-tolerant search mechanisms, and the efficient retrieval capabilities of the vector engine. However, there are shortcomings, such as the accuracy of results generated by large models and insufficient comprehensive analysis of user search data. Future efforts will be devoted to improving these issues to increase the effectiveness of the research.

Key words: large language models, digital humanities, ancient document retrieval, semantic understanding, cultural heritage

Table 1

Core metadata fields of ancient documents"

类别 元素标识 元素名称 dc名称
古籍 title 题名 title
creator 主要责任者 creator
contributor 其他责任者 contributor
date 日期 date
publisher 出版者 publisher
description 附注 description
relation 相关资源 relation
subject 主题 subject
coverage 时空范围 coverage
language 语种 language
source 来源 source
rights 权限 rights
type 类型 type
format 格式 format
identifier 标识符 identifier
edition 版本类型 edition
physical description 载体形态 physical_description
provenance 收藏历史 provenance
ancient book preservation 文献保护 ancient_book_preservation
location 馆藏信息 location
copy 其他复本信息 copy
拓片 title 题名,题,題名 title
creator 主要责任者 creator
contributor 其他责任者 contributor
date 日期 date
description 附注 description
relation 相关资源 relation
subject 主题 subject
coverage 时空范围 coverage
language 语种 language
source 来源 source
rights 权限 rights
type 类型 type
format 格式 format
identifier 标识符 identifier
edition 版本类型 edition
physical description 载体形态 physical_description
provenance 收藏历史 provenance
ancient book preservation 文献保护 ancient_book_preservation
location 馆藏信息 location
copy 其他复本信息 copy
place of object 金石所在地 place_of_object
materials of object 金石材质 materials_of_object
handwriting 书刻特征 handwriting
full text 录文 full_text


Schematic diagram of the basic semantic retrieval model"

Table 2

Examples of large language model processing metadata and search terms"

prompt 提示语 大模型处理后json 主要参考点




















{ "正题名及说明": { "繁体": "道光條約: 8卷", "简体": "道光条约: 8卷", "内容简介": "该书记录了清代道光年间中国与外国缔结的重要条约,涵盖了外交关系、国际交往及相关法律条文,是研究清代外交史的重要文献。" },

"主要责任者": { "繁体": "汪毅[等]編纂", "简体": "汪毅[等]编纂" },

"出版信息": {

"出版时间": { "繁体": "民國初期[1912-1927]", "简体": "民国初期[1912-1927]" },

"出版机构": { "繁体": "外交部印刷所", "简体": "外交部印刷所" }


"版本類別": {"繁体": "鉛印本", "简体": "铅印本" },

"外觀形態": { "繁体": "綫裝,25.7cm", "简体": "线装,25.7cm" },

"一般附注": [ { "繁体": "書尾有版權葉。", "简体": "书尾有版权页。"},

{ "繁体": "共4冊。","简体": "共4册。" }],

"主題詞": [ "條約", "外交", "國際關係", "清代" ],

"古籍語種": { "繁体": "漢文", "简体": "汉文" },

"道光皇帝与年号说明": {

"道光皇帝": { "姓名": "爱新觉罗·旻宁", "统治时间": "1820年-1850年", "简介": "道光皇帝是清朝的第八位皇帝,他在位期间,清朝国力逐渐衰落,面临着内忧外患的局面。" },

"道光年号": { "起始年份": "1821年", "结束年份": "1850年", "简介": "道光年号共使用了30年,这一时期清朝经历了鸦片战争等重大事件,标志着清朝开始进入半殖民地半封建社会。" } },

"自动抽取关键词": [ "道光条约", "清代外交史", "条约缔结", "外交关系", "国际交往", "法律条文", "道光皇帝", "爱新觉罗·旻宁", "清朝第八位皇帝", "国力衰落", "内忧外患", "道光年号", "鸦片战争", "半殖民地半封建社会"],

"推荐图书": [

{ "书名": "鸦片战争", "作者": "蓝诗玲", "简介": "该书站在一个更加广阔的视域来考察中英鸦片战争,旨在让读者跨出地域的限制,认真反思这场世界冲突的种种罪恶和矛盾。" },

{ "书名": "中国近代史", "作者": "陈恭禄", "简介": "这本书对中国近代史进行了深入的剖析,涵盖了从鸦片战争到辛亥革命的重要历史事件,对《南京条约》及其影响也有所涉及。" }] }









{"近代条约相关检索词":["南京条约","虎门条约","望夏条约","马关条约","辛丑条约","割地","赔款","五口通商","协定关税","最惠国待遇","通商口岸","1842年条约","19世纪条约","鸦片战争后条约","甲午中日战争后条约","义和团运动后条约","半殖民地半封建社会","主权丧失","赔款数目庞大","社会经济发展影响","中英条约","中美条约","中日条约","十一国条约","不平等条约","和约","战败条约","南京(条约签订地点)","广东虎门(条约签订地点)","日本马关(条约签订地点)","北京(条约签订地点)"]} 扩展检索词的含义,在名称、关键词、签订时间与背景、影响、相关国家、类型、签订地点等方面的扩展










{"name":"光徐皇帝","related_real_figures":[{"name":"光武帝","dynasty":"东汉","reign_period":"公元25年-57年"},{"name":"光绪帝","dynasty":"清朝","reign_period":"1875年-1908年"}],"note":"如果'光徐皇帝'确实指代某个历史人物,请进一步核实相关资料。"} 纠错和扩展功能


Schematic diagram of semantic space"


Embedding model schematic diagram"


Search interface"

