农业图书情报学报

• •    

融合思维链的中医药古籍多任务知识抽取方法研究

安波1,2   

  1. 1. 中国社会科学院民族学与人类学研究所,北京 100081
    2. 中国社会科学院大学,北京 102488
  • 收稿日期:2025-06-09 出版日期:2025-10-09
  • 作者简介:

    安波(1986- ),男,博士,副研究员,研究方向为自然语言处理、知识图谱

  • 基金资助:
    国家社科基金一般项目“藏汉双语藏文古籍知识图谱构建研究”(22BTQ010)

A Multi-Task Knowledge Extraction Method for Traditional Chinese Medicine Ancient Books Integrating Chain-of-Thought

AN Bo1,2   

  1. 1. Institute of Ethnology and Anthropology, Chinese Academy of Social Sciences, Beijing 100081
    2. University of Chinese Academy of Social Sciences, Beijing 102488
  • Received:2025-06-09 Online:2025-10-09

摘要:

[目的/意义] 着眼于中医药古籍的现代化应用,鉴于其存在版式复杂、繁简字与异体字并存、术语别名使用混乱以及跨段落语义关联性强等问题,致使自动化知识抽取面临困境。本研究提出一体化技术路径,旨在提高批量数字化效率、抽取准确性与可解释性,为数字人文与情报学研究提供服务。 [方法/过程] 构建“古籍→知识图谱”的流程:借助多模态大模型实现古籍文字识别与繁简转换;提出融合思维链与本体约束的三任务联合模型CoTCMKE,同步开展实体识别、关系抽取与实体对齐工作,并以《伤寒杂病论》进行验证。 [结果/结论] 实验结果显示,CoTCMKE相较传统的指令微调方法,在实体识别、关系抽取、实体对齐任务的F1值分别提高了3.1、1.6、1.3个百分点。在跨典籍迁移至《金匮要略》的评测中,该模型无需再次训练即可保持稳定性能,展现出良好的鲁棒性与可扩展性。研究表明,思维链与本体的显式融合是构建中医药古籍知识图谱的有效途径,能够在少量标注的条件下实现持续增量与跨典籍扩展。

关键词: 中医药古籍知识图谱, 多模态大模型, 思维链推理

Abstract:

[Purpose/Significance] Although traditional Chinese Medicine (TCM) classics contain valuable knowledge they remain difficult to process automatically due to their complex page layouts, coexistence of traditional and simplified variant characters, alias-rich terminology, and strong cross-paragraph semantic dependencies. Existing pipelines often split the processes of optical character recognition (OCR), normalization, entity recognition, relation extraction, and entity alignment. This leads to error propagation. Additionally, many studies also focus on modern clinical texts rather than historical sources. This paper addresses these gaps by presenting an end-to-end pipeline that transforms ancient page images to a structured knowledge graph. The central contribution is the CoTCMKE, which is a chain-of-thought (CoT) and ontology-constrained joint model that performs named entity recognition (NER), relation extraction (RE), and entity alignment (EA) simultaneously. By making intermediate reasoning explicit and binding predictions to a TCM ontology, the framework improves batch digitization efficiency, extraction accuracy, and interpretability for digital humanities and library & information science (LIS) applications. [Method/Process] We built a unified pipeline with three steps. 1) Text recognition: a multimodal large language model (MLLM) recognizes text directly from complex pages with mixed vertical/horizontal layouts and performs context-aware traditional-to-simplified conversion. 2) Ontology construction: following semantic completeness, multimodal friendliness, evolvability, and interoperability, experts curate an ontology of core TCM concepts (e.g., diseases, symptoms, formulae, herbs) with aliases and constraints to guide decoding and ensure consistency. 3) Knowledge extraction: CoTCMKE integrates CoT with ontology constraints for multi-task extraction, which is entity localization and normalization, ontology-consistent relation generation, and cross-passage/cross-volume entity alignment. Constraint-aware decoding uses immediate checks and backtracking when a generated entity or relation violates ontology rules or alias mappings. For data, we used Shang Han Lun. Qwen2.5-VL-32B assists OCR, conversion, and initial auto-labeling; two TCM-trained annotators independently review and reconcile results. The final sets contain 2 340 NER items, 1 880 RE items, and 450 EA pairs, evaluated with 10-fold cross-validation. The multimodal large language model (MLLM) was adapted via LoRA with early stopping. The comparisons include traditional deep models, a unified IE framework, prompt-only inference, and a LoRA-SFT baseline. [Results/Conclusions] On Shang Han Lun, CoTCMKE outperformed LoRA-SFT by +3.1 F1 for NER, +1.6 for RE, and +1.3 for EA. In cross-book transfer to Jin Kui Yao Lue, the model maintained stable performance without retraining, indicating robustness and scalability. Ablation results showed that CoT reduced boundary and ambiguity errors, while ontology constraints curbed illegal triples and alias fragmentation. Combining both yielded the best overall results. The analysis yielded the following observations. 1) explicit medical relation templates act as semantic guardrails; 2) proactive alias consolidation before decoding reduces entity scattering and improves alignment; 3) explicit type-path guidance helps disambiguate fine-grained categories (e.g., pulse findings vs. general symptoms). The framework supports the automatic construction of "formula-symptom-herb" triples, as well as alias and variant normalization. It also supports evidence-linked semantic searches and navigation, which benefit LIS workflows, education, and research. Current limitations include the scope of the curated ontology and its focus on two classics. Future work will extend to additional TCM classics and broader historical corpora, support continual incremental learning, and deliver knowledge services based on the constructed graphs.

Key words: traditional Chinese medicine knowledge graph, multi-modal large language model, chain-of-thought

中图分类号:  T39

引用本文

安波. 融合思维链的中医药古籍多任务知识抽取方法研究[J/OL]. 农业图书情报学报. https://doi.org/10.13998/j.cnki.issn1002-1248.25-0422.

AN Bo. A Multi-Task Knowledge Extraction Method for Traditional Chinese Medicine Ancient Books Integrating Chain-of-Thought[J/OL]. Journal of library and information science in agriculture. https://doi.org/10.13998/j.cnki.issn1002-1248.25-0422.