中文    English

Journal of Library and Information Science in Agriculture ›› 2024, Vol. 36 ›› Issue (9): 89-101.doi: 10.13998/j.cnki.issn1002-1248.24-0615

Previous Articles     Next Articles

Digital Humanities & Large Language Models: Practice and Research in Semantic Retrieval of Ancient Documents

Haoxian WANG, Ziming ZHOU, Feifei DING, Chengfu WEI   

  1. Peking University Library, Beijing 100871
  • Received:2024-06-21 Online:2024-09-05 Published:2025-01-13

Abstract:

[Purpose/Significance] Against the backdrop of the increasing popularity of artificial intelligence technology, particularly large language models, this paper aims to explore their applications in the field of digital humanities, with a particular focus on the retrieval of ancient documents. Through the practice and exploration of the ancient document retrieval platform at Peking University Library, this study not only introduces new perspectives and methods to the field of digital humanities, but also promotes academic research and cultural heritage. It also provides practical references for other university libraries, which is an important guide. [Method/Process] The article begins with an overview of the origins and development of the digital humanities, emphasizing its central role in humanities research. The paper then examines the current state of the arts in large language models and analyzes their potential and advantages for identifying and classifying ancient documents, semantic understanding and parsing, and information extraction and association. Through the analysis of practical case studies, this paper constructs a fundamental semantic retrieval model, the core architecture of which consists of two critical components. First, the construction phase of the retrieval engine involves meticulous pre-processing of the ancient document information to generate basic metadata. Using large-scale models, these metadata are subjected to in-depth processing and enhancement to create auxiliary search fields and enriched text. In addition, the text processed by the model and the original text are transformed into semantic vectors, which are then stored in an efficient vector engine for rapid retrieval. Second, the search and sort component is another core part of the model. This part accurately processes the user's search terms through large models to generate extended content and, in conjunction with the search terms, creates accurate semantic vectors. Utilizing the previously constructed vector engine, the model can efficiently retrieve relevant documents and intelligently sort the search results based on specific algorithms, ensuring that users can quickly obtain the most relevant and valuable information. Taking the ancient document system collection data of Peking University Library as the research object, the paper processes over 250,000 records, primarily consisting of ancient books and rubbings, as well as over 10 million metadata items. Using the gradio framework on a server equipped with two NVIDIA RTX 4090 24G graphics cards, a semantic retrieval platform was created to effectively manage and retrieve these vast amounts of data. [Results/Conclusions] The main strengths and contributions of the study lie in the standardized metadata organization, the metadata extension supported by large models, the support for natural language search terms, the fault-tolerant search mechanisms, and the efficient retrieval capabilities of the vector engine. However, there are shortcomings, such as the accuracy of results generated by large models and insufficient comprehensive analysis of user search data. Future efforts will be devoted to improving these issues to increase the effectiveness of the research.

Key words: large language models, digital humanities, ancient document retrieval, semantic understanding, cultural heritage

CLC Number: 

  • G252.7

Table 1

Core metadata fields of ancient documents"

类别 元素标识 元素名称 dc名称
古籍 title 题名 title
creator 主要责任者 creator
contributor 其他责任者 contributor
date 日期 date
publisher 出版者 publisher
description 附注 description
relation 相关资源 relation
subject 主题 subject
coverage 时空范围 coverage
language 语种 language
source 来源 source
rights 权限 rights
type 类型 type
format 格式 format
identifier 标识符 identifier
edition 版本类型 edition
physical description 载体形态 physical_description
provenance 收藏历史 provenance
ancient book preservation 文献保护 ancient_book_preservation
location 馆藏信息 location
copy 其他复本信息 copy
拓片 title 题名,题,題名 title
creator 主要责任者 creator
contributor 其他责任者 contributor
date 日期 date
description 附注 description
relation 相关资源 relation
subject 主题 subject
coverage 时空范围 coverage
language 语种 language
source 来源 source
rights 权限 rights
type 类型 type
format 格式 format
identifier 标识符 identifier
edition 版本类型 edition
physical description 载体形态 physical_description
provenance 收藏历史 provenance
ancient book preservation 文献保护 ancient_book_preservation
location 馆藏信息 location
copy 其他复本信息 copy
place of object 金石所在地 place_of_object
materials of object 金石材质 materials_of_object
handwriting 书刻特征 handwriting
full text 录文 full_text

Fig.1

Schematic diagram of the basic semantic retrieval model"

Table 2

Examples of large language model processing metadata and search terms"

prompt 提示语 大模型处理后json 主要参考点

请扩展以下元数据,要求:

1、提供繁简对照

2、提取命名实体并对其内容进行扩展,如人物、时间、地点的介绍内容

3、自动提取关键词

4、推荐相关图书

5、以json格式输出

元数据:

正題名及説明:道光條約:8卷

主要責任者:汪毅[等]編纂

出版:民國初期[1912-1927]外交部印刷所

版本類別:鉛印本

外觀形態:綫裝,25.7cm

一般附注:書尾有版權葉。

一般附注:4冊

主題詞:條約,外交,國際關係,清代

古籍語種:漢文

典藏號:X/387.087/3107a:1

館藏信息:北京大學圖書館

服務範圍:館內閱覽

{ "正题名及说明": { "繁体": "道光條約: 8卷", "简体": "道光条约: 8卷", "内容简介": "该书记录了清代道光年间中国与外国缔结的重要条约,涵盖了外交关系、国际交往及相关法律条文,是研究清代外交史的重要文献。" },

"主要责任者": { "繁体": "汪毅[等]編纂", "简体": "汪毅[等]编纂" },

"出版信息": {

"出版时间": { "繁体": "民國初期[1912-1927]", "简体": "民国初期[1912-1927]" },

"出版机构": { "繁体": "外交部印刷所", "简体": "外交部印刷所" }

},

"版本類別": {"繁体": "鉛印本", "简体": "铅印本" },

"外觀形態": { "繁体": "綫裝,25.7cm", "简体": "线装,25.7cm" },

"一般附注": [ { "繁体": "書尾有版權葉。", "简体": "书尾有版权页。"},

{ "繁体": "共4冊。","简体": "共4册。" }],

"主題詞": [ "條約", "外交", "國際關係", "清代" ],

"古籍語種": { "繁体": "漢文", "简体": "汉文" },

"道光皇帝与年号说明": {

"道光皇帝": { "姓名": "爱新觉罗·旻宁", "统治时间": "1820年-1850年", "简介": "道光皇帝是清朝的第八位皇帝,他在位期间,清朝国力逐渐衰落,面临着内忧外患的局面。" },

"道光年号": { "起始年份": "1821年", "结束年份": "1850年", "简介": "道光年号共使用了30年,这一时期清朝经历了鸦片战争等重大事件,标志着清朝开始进入半殖民地半封建社会。" } },

"自动抽取关键词": [ "道光条约", "清代外交史", "条约缔结", "外交关系", "国际交往", "法律条文", "道光皇帝", "爱新觉罗·旻宁", "清朝第八位皇帝", "国力衰落", "内忧外患", "道光年号", "鸦片战争", "半殖民地半封建社会"],

"推荐图书": [

{ "书名": "鸦片战争", "作者": "蓝诗玲", "简介": "该书站在一个更加广阔的视域来考察中英鸦片战争,旨在让读者跨出地域的限制,认真反思这场世界冲突的种种罪恶和矛盾。" },

{ "书名": "中国近代史", "作者": "陈恭禄", "简介": "这本书对中国近代史进行了深入的剖析,涵盖了从鸦片战争到辛亥革命的重要历史事件,对《南京条约》及其影响也有所涉及。" }] }

1、繁简体的转换

2、json格式数据的生成

3、内容简介的扩展

4、道光皇帝、道光年号的介绍

5、自动关键词

6、相关书目推荐

请语义理解检索词,并以json格式给出相关检索词要求中英对照

检索词:近代条约

{"近代条约相关检索词":["南京条约","虎门条约","望夏条约","马关条约","辛丑条约","割地","赔款","五口通商","协定关税","最惠国待遇","通商口岸","1842年条约","19世纪条约","鸦片战争后条约","甲午中日战争后条约","义和团运动后条约","半殖民地半封建社会","主权丧失","赔款数目庞大","社会经济发展影响","中英条约","中美条约","中日条约","十一国条约","不平等条约","和约","战败条约","南京(条约签订地点)","广东虎门(条约签订地点)","日本马关(条约签订地点)","北京(条约签订地点)"]} 扩展检索词的含义,在名称、关键词、签订时间与背景、影响、相关国家、类型、签订地点等方面的扩展

请语义理解检索词,并以json格式给出相关检索词

检索词:我想检索一些古文献,主要是写河北高阳县的,包括历史演变,或者河北高阳县名人

{"检索词":[{"简体":"河北高阳县","繁体":"河北高陽縣"},{"简体":"高阳县志","繁体":"高陽縣志"},{"简体":"高阳县历史","繁体":"高陽縣歷史"},{"简体":"高阳县历史演变","繁体":"高陽縣歷史演變"},{"简体":"颛顼","繁体":"顓頊"},{"简体":"孙承宗(高阳)","繁体":"孫承宗(高陽)"},{"简体":"李鸿藻(高阳)","繁体":"李鴻藻(高陽)"},{"简体":"高阳名人","繁体":"高陽名人"},{"简体":"高阳名人传记","繁体":"高陽名人傳記"},{"简体":"高阳名人著作","繁体":"高陽名人著作"},{"简体":"高阳纺织业","繁体":"高陽紡織業"},{"简体":"高阳戏曲文化","繁体":"高陽戲曲文化"},{"简体":"高阳古迹","繁体":"高陽古迹"},{"简体":"高阳民俗","繁体":"高陽民俗"},{"简体":"高阳八才旧里","繁体":"高陽八才舊里"},{"简体":"高阳古代诗词","繁体":"高陽古代詩詞"},{"简体":"高阳古代碑刻","繁体":"高陽古代碑刻"},{"简体":"高阳古代建筑","繁体":"高陽古代建築"}]}

1、自然语言的处理

2、高阳历史沿革中运用颛顼为名的提示

3、人物提示

4、繁简对应

请协助检查检索词:

光徐皇帝

{"name":"光徐皇帝","related_real_figures":[{"name":"光武帝","dynasty":"东汉","reign_period":"公元25年-57年"},{"name":"光绪帝","dynasty":"清朝","reign_period":"1875年-1908年"}],"note":"如果'光徐皇帝'确实指代某个历史人物,请进一步核实相关资料。"} 纠错和扩展功能

Fig.2

Schematic diagram of semantic space"

Fig.3

Embedding model schematic diagram"

Fig.4

Search interface"

1
国务院关于印发新一代人工智能发展规划的通知[EB/OL]. [2024-06-02].
2
四部门关于印发国家人工智能产业综合标准化体系建设指南( 2024版)的通知[EB/OL]. [2024-07-03].
3
丘东江. 图情小百科[M]. 北京: 中国学术期刊(光盘版)电子杂志社, 2023.
QIU D J. Encyclopedia of library and information[M]. Beijing: China Academic Journals (CD Edition) Electronic Magazine, 2023.
4
VANDEGRIFT M. What is digital humanities and what's it doing in the library?[J]. The library with the lead pipe, 2012: 29-43.
5
刘炜, 叶鹰. 数字人文的技术体系与理论结构探讨[J]. 中国图书馆学报, 2017, 43(5): 32-41.
LIU W, YE Y. Exploring technical system and theoretical structure of digital humanities[J]. Journal of library science in China, 2017, 43(5): 32-41.
6
朱本军, 聂华. 跨界与融合: 全球视野下的数字人文——首届北京大学“数字人文论坛”会议综述[J]. 大学图书馆学报, 2016, 34(5): 16-21.
ZHU B J, NIE H. Crossing boundaries and engaging communities: Digital humanities in a global perspective - Conference review of the 1st Peking University digital humanities forum[J]. Journal of academic libraries, 2016, 34(5): 16-21.
7
黄水清. 人文计算与数字人文: 概念、问题、范式及关键环节[J]. 图书馆建设, 2019(5): 68-78.
HUANG S Q. Humanity computing and digital humanities: Concept, problem, paradigm and key step[J]. Library development, 2019(5): 68-78.
8
柯平, 宫平. 数字人文研究演化路径与热点领域分析[J]. 中国图书馆学报, 2016, 42(6): 13-30.
KE P, GONG P. The evolution path and hot topics of digital humanities research[J]. Journal of library science in China, 2016, 42(6): 13-30.
9
李国新. 中国古籍资源数字化的进展与任务[J]. 大学图书馆学报, 2002, 20(1): 21-26, 41-91.
LI G X. The development and task of Chinese ancient book resources digitization[J]. Journal of academic libraries, 2002, 20(1): 21-26, 41-91.
10
颜艳萍. 数字人文背景下图书馆古籍数字化新实践——以云南省图书馆古籍数据库建设为例[J]. 图书馆学刊, 2020, 42(7): 39-43.
YAN Y P. New practice of digitalization of ancient books in library under the background of digital humanities - Taking the construction of ancient books database in Yunnan library as an example[J]. Journal of library science, 2020, 42(7): 39-43.
11
姚伯岳. “高校古文献资源库”检索功能综述[C]//中国索引学会第三次全国会员代表大会暨学术论坛论文集. 北京: 北京大学图书馆, 2008: 9.
YAO B Y. A review of the search functions of "University Ancient Document Resources Library"[C]//Proceedings of the Third National Members' Congress and Academic Forum of China Society of Indexers. Beijing: Peking University Library, 2008: 9.
12
郭伟玲, 戴艳清. 论古籍数字化的检索问题[J]. 图书馆理论与实践, 2011(10): 13-16.
GUO W L, DAI Y Q. On the digital retrieval of ancient books[J]. Library theory and practice, 2011(10): 13-16.
13
魏晓萍. 数字人文背景下数字化古籍的深度开发利用[J]. 农业图书情报学刊, 2018, 30(9): 106-110.
WEI X P. Deep development and utilization of digital ancient books under the background of digital humanities[J]. Journal of library and information sciences in agriculture, 2018, 30(9): 106-110.
14
卢彤, 李明杰. 中文古籍数字化成果辅助人文学术研究功能的调查[J]. 图书与情报, 2019(1): 70-79.
LU T, LI M J. Investigation on functions of digital productions of Chinese ancient books in assisting humanities research[J]. Library & information, 2019(1): 70-79.
15
FENG H, WANG Z J, TANG J Q, et al. UniDoc: A universal large multimodal model for simultaneous text detection, recognition, spotting and understanding[J/OL]. arXiv: 2308.11592, 2023.
16
TANG M L, XIE S M, LIU X R. Ancient character recognition: A novel image dataset of Shui manuscript characters and classification model[J]. Chinese journal of electronics, 2023, 32(1): 64-75.
17
李绅, 胡韧奋, 王立军. 古汉语大语言模型的构建及应用研究[J]. 语言战略研究, 2024, 9(5): 22-33.
LI S, HU R F, WANG L J. Construction and application of ancient Chinese large language model[J]. Chinese journal of language policy and planning, 2024, 9(5): 22-33.
18
黄水清, 王晓光, 夏翠娟, 等. 推进新时代古籍工作,加快创新智能化发展[J].农业图书情报学报, 2022, 34(5):4-20.
HUANG S Q, WANG X G, XIA C J, et al. Advancing the work on ancient classics in the new era and accelerating innovative and intelligent development[J]. Journal of library and information science in agriculture, 2022, 34(5): 4-20.
19
DUAN S, WANG J, SU Q. Restoring ancient ideograph: A multimodal multitask neural network approach[C]//Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italia: ELRA, ICCL, 2024: 14005-14015.
20
YI X, SUN M, LI R, LI W. Can large language model comprehend ancient Chinese? A preliminary test on ACLUE[C]//Proceedings of the Ancient Language Processing Workshop, Varna, Bulgaria: INCOMA Ltd., 2023: 80-87.
21
RIEMENSCHNEIDER F, FRANK A. Graecia capta ferum victorem cepit. Detecting Latin Allusions to Ancient Greek Literature[C]//Proceedings of the Ancient Language Processing Workshop, Varna, Bulgaria: INCOMA Ltd., 2023: 30-38.
22
“吾与点”古籍智能处理系统[EB/OL]. [2024-06-26].
[1] Qiaofei CHEN, Haomin ZHOU, Xin XU. Digital-Intelligence Empowers Cultural Heritage Protection and Inheritance:Taking the International Communication of Chinese Tea Culture as an Example [J]. Journal of Library and Information Science in Agriculture, 2024, 36(6): 62-78.
[2] WANG Wei, XU Xin. Transformation and Development of Intangible Cultural Heritage through Technology [J]. Journal of Library and Information Science in Agriculture, 2024, 36(1): 58-70.
[3] ZHANG Xingwang, DUAN Xuechun, XIN Jie. A Study on the Knowledge-Based Description Framework and Application Scenarios of Ancient Chinese Map Documents in the Digital Intelligence Era [J]. Journal of Library and Information Science in Agriculture, 2023, 35(9): 4-11.
[4] XU Kuiyi, LU Keqi, LU Hejian. Strategies for Promoting Chinese Public Libraries' Participation in Cultural Heritage Inheritance: Taking the Conservation of Ancient Books in Public Libraries as an Example [J]. Journal of Library and Information Science in Agriculture, 2023, 35(7): 85-93.
[5] SHAN Shuyang, XIA Cuijuan, LIU Qianqian. Exploration of Genealogy Public Knowledge Service Model with the Resources and Technology: Taking the Exhibition Project of "AR Surname Wall" as an Example [J]. Journal of Library and Information Science in Agriculture, 2023, 35(6): 83-92.
[6] LV Ruijuan, ZHANG Jingbei, YAN Dan, CAI Yingchun. Innovative Development of AIGC and GLAM: Review of "Shaping the Future: AIGC and GLAM Innovative Development" Cutting-Edge Academic Forum [J]. Journal of Library and Information Science in Agriculture, 2023, 35(5): 27-36.
[7] XIAO Keyi, LI Yunfan. Present Situation and Enlightenment of Chinese University Libraries' Participation in Digital Humanistic Educational Service from the Perspective of Supply and Demand Matching [J]. Journal of Library and Information Science in Agriculture, 2023, 35(5): 37-50.
[8] HE Mengfan, XIONG Lejia, DING Ying, YANG Weishu. Innovative Development and Utilization of Agricultural Cultural Heritage Enabled by Digital Technologies [J]. Journal of Library and Information Science in Agriculture, 2023, 35(3): 71-80.
[9] GUO Pengrui, WEN Tingxiao. Research of the Impact of LLMs on Information Retrieval Systems and Users' Information Retrieval Behavior [J]. Journal of Library and Information Science in Agriculture, 2023, 35(11): 13-22.
[10] ZHANG Zhixiong, ZENG Jianxun, XIA Cuijuan, WANG Dongbo, LI Baiyang, CAI Yingchun. Information Resource Management Researchers' Thinking about the Opportunities and Challenges of AIGC [J]. Journal of Library and Information Science in Agriculture, 2023, 35(1): 4-25.
[11] MOU Lijun, XU Xin. Research on Intangible Cultural Heritage Digital Resources Development Based on NFT [J]. Journal of Library and Information Science in Agriculture, 2022, 34(6): 14-23.
[12] HUANG Shuiqing, WANG Xiaoguang, XIA Cuijuan, OUYANG Jiang. Advancing the Work on Ancient Classics in the New Era and Accelerating Innovative and Intelligent Development [J]. Journal of Library and Information Science in Agriculture, 2022, 34(5): 4-20.
[13] CHEN Wen, WANG Dongliang, XU Yunhao, CHEN Yuping, YANG Youqing. The Construction of Metadata Model for Digital Resources of Cultural Creativity Works [J]. Journal of Library and Information Science in Agriculture, 2022, 34(12): 77-86.
[14] SHANG Hongli, ZHANG Sijie, WEI Zhipeng, YANG Kehu, ZHOU Wenjie. Evidence Integration Framework of Evidence-based Digital Humanities [J]. Journal of Library and Information Science in Agriculture, 2022, 34(11): 38-47.
[15] AI Yuxi, XU Jian, HE Lin, QI Yun. A Construction Method of the Classification System Oriented to Content Analysis of Ancient Books [J]. Journal of Library and Information Science in Agriculture, 2021, 33(9): 18-26.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!