中文    English

Journal of Library and Information Science in Agriculture ›› 2023, Vol. 35 ›› Issue (7): 52-62.doi: 10.13998/j.cnki.issn1002-1248.23-0355

Previous Articles     Next Articles

Construction and Application of Semantic Retrieval Model for Ancient Agricultural Literature

LIU Nanzhu1,2, CUI Yunpeng1,2,*, WANG Mo1,2   

  1. 1. Institute of Agricultural Information, Chinese Academy of Agricultural Sciences, Beijing 100081;
    2. Key Laboratory of Agricultural Big Data, Ministry of Agriculture and Rural Affairs, Beijing 100081
  • Received:2023-05-29 Online:2023-07-05 Published:2023-09-20

Abstract: [Purpose/Significance] The ancient Chinese agricultural books are the main carrier of traditional agricultural experience, and represent the productivity and the essence of agricultural history in China. The value of agricultural knowledge in them has not disappeared with the progress of the times, and still has practical guidance for the problems that arise in modern agriculture. However, the ancient Chinese agricultural books are written in ancient Chinese, which are obscure and without punctuation, making them difficult to use. Semantic retrieval is a retrieval method that automatically queries and extracts relevant information from information sources at the semantic level. It can accurately capture the true intention behind user problems and conduct searches based on it, and thereby it is capable of returning more accurate and the most consistent results to users. However, currently most relevant research only focuses on major languages, and there is insufficient research on sentence embedding in ancient Chinese prose. In order to fill the gap in the field and provide scholars with more convenient methods for retrieving ancient agricultural knowledge and tracing ancient agricultural knowledge, this study is based on comparative learning methods to construct a semantic retrieval model that can automatically return the most relevant ancient agricultural paragraph with input, using vernacular Chinese as the query. [Method/Process] SikuBERT, which is based on Siku Quanshu as the training corpus, is used as the basic model. Based on the method of comparative learning, the model is continued to be trained using the self-built ancient agricultural dataset, and a semantic retrieval model that can support the use of vernacular as a query and return the ancient agricultural paragraphs most similar to the query semantics is obtained. [Results/Conclusions] The Spearman coefficient of the ancient agricultural text semantic retrieval model can achieve 86.51% performance on the test set, which is a certain degree of improvement compared to the baseline model's 83.69% performance on the test set. The recall situation on the self built ancient agricultural literature retrieval test set has been improved to a certain extent compared to the baseline model, and the model can have good retrieval results on ancient agricultural literature. However, semantic retrieval models usually require relevant semantic similarity datasets or semantic matching datasets for training. Due to the lack of large-scale and pure ancient Chinese data in the field of ancient agricultural literature, and the high cost of constructing relevant datasets requiring personnel with high-standard relevant professional qualifications, this experiment used a self-built dataset for training, which is limited by the quantity and quality of ancient agricultural language corpus data. The current semantic retrieval model for ancient agricultural literature is still not as effective as expected. In the future, we will search for training methods suitable for small samples, such as transfer learning based on cross language pre-training models to improve the retrieval performance.

Key words: ancient agricultural script, semantic retrieval, comparative learning, model building, deep learning

CLC Number: 

  • TP391
[1] 张波. 农史研究法[M]. 咸阳: 西北农林科技大学出版社, 2019.
ZHANG B.Agricultural history research method[M]. Xianyang: Northwest A&F University Press, 2019.
[2] 葛小寒. 文献、史料与知识——古农书研究的范式及其转向[J]. 中国农史, 2019, 38(2): 12-25.
GE X H.Text, history date and knowledge - The paradigms of ancient agricultural books' research in agricultural history of China[J]. Agricultural history of China, 2019, 38(2): 12-25.
[3] 何凡能, 李柯, 刘浩龙. 历史时期气候变化对中国古代农业影响研究的若干进展[J]. 地理研究, 2010, 29(12): 2289-2297.
HE F N, LI K, LIU H L.The influence of historical climate change on agriculture in ancient China[J]. Geographical research, 2010, 29(12): 2289-2297.
[4] 曾雄生. 也释“白田”兼“水田”——与辛德勇先生商榷[J]. 自然科学史研究, 2012, 31(2): 201-208.
ZENG X S.An alternative interpretation of Baitian (white field) and Shuitian (water field): Discussion with Mr. Xin Deyong[J]. Studies in the history of natural sciences, 2012, 31(2): 201-208.
[5] TANG M, WANG X, HOU K, et al.Carbon and nitrogen stable isotope of the human bones from the Xiaonanzhuang cemetery, Jinzhong, Shanxi: A preliminary study on the expansion of wheat in ancient Shanxi, China[J]. Acta anthropologica sinica, 2018, 37(2): 318-30.
[6] 刘志国, 徐旺生. 《齐民要术》的盐史信息考探[J]. 中国科技史杂志, 2021, 42(1): 91-99.
LIU Z G, XU W S.The information on salt history in the qimin Yaoshu[J]. The Chinese journal for the history of science and technology, 2021, 42(1): 91-99.
[7] ZHOU X Y, ZHU L, SPENGLER R N, et al.Water management and wheat yields in ancient China: Carbon isotope discrimination of archaeological wheat grains[J]. The holocene, 2021, 31(2): 285-293.
[8] CHEN S C.Exploring the use of electronic resources by humanities scholars during the research process[J]. Electron libr, 2019, 37: 240-254.
[9] WANG S Y, CUI D A, LV Y N, et al.Cangpu oral liquid as a possible alternative to antibiotics for the control of undifferentiated calf diarrhea[J]. Frontiers in veterinary science, 2022, 9: 879857.
[10] XIA X Y, LIN Z C, SHAO K P, et al.Combination of white tea and peppermint demonstrated synergistic antibacterial and anti-inflam-matory activities[J]. Journal of the science of food and agriculture, 2021, 101(6): 2500-2510.
[11] WANG N, LIU X, LI J G, et al.Antibacterial mechanism of the synergistic combination between streptomycin and alcohol extracts from the Chimonanthus salicifolius S. Y. Hu. leaves[J]. Journal ofethnopharmacology, 2020, 250: 112467.
[12] 李明杰, 陈梦石, 孟彬. 中国古代科技文献整理出版七十年回望(1949-2019)[J]. 出版科学, 2019, 27(5): 22-29.
LI M J, CHEN M S, MENG B.Review on the collation of ancient Chinese scientific and technological documents in the past 70 years[J]. Publishing journal, 2019, 27(5): 22-29.
[13] 曹玲, 常娥, 薛春香. 农史研究的新工具——中国农业遗产信息平台的设计与构建[J]. 中国农史, 2006, 25(1): 127-133.
CAO L, CHANG E, XUE C X.A new tool of agricultural history research - Design and construction of "agricultural inheritance information database"[J]. Agricultural history of China, 2006, 25(1): 127-133.
[14] LIU S C, XIAO F, OU W W, et al. Cascade ranking for operational e-commerce search[J]. arXiv:1706.02093, 2017.
[15] PEDERSEN J.Query understanding at being[R]. Invited Talk: SIGIR, 2010.
[16] FAN Y X, XIE X H, CAI Y Q, et al.Pre-training methods in information retrieval[M]. Beijing: Now Publishers, 2022.
[17] CHEN R C, GALLAGHER L, BLANCO R, et al.Efficient cost-aware cascade ranking in multi-stage retrieval[C]// Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2017: 445-454.
[18] LIANG D, XU P, SHAKERI S, et al.Embedding-based zero-shot retrieval through query generation[J]. arXiv preprint arXiv:200910270, 2020.
[19] FURNAS G W, LANDAUER T K, GOMEZ L M, et al.The vocabu-lary problem in human-system communication[J]. Communications of the ACM, 1987, 30(11): 964-971.
[20] ZHAO L, CALLAN J.Term necessity prediction[C]// Proceedings of the 19th ACM international conference on Information and knowledge management. New York: ACM, 2010: 259-268.
[21] LI H, XU J.Semantic matching in search[J]. Foundations and trends in information retrieval, 2014, 7(5): 343-469.
[22] LAVRENKO V, CROFT W B.Relevance based language models[C]//Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2001: 120-127.
[23] LESK M E.Word-word associations in document retrieval systems[J]. American documentation, 1969, 20(1): 27-38.
[24] QIU Y G, FREI H P.Concept based query expansion[C]// Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 1993: 160-169.
[25] XU J X, CROFT W B.Quary expansion using local and global document analysis[J]. ACM SIGIR forum, 2017, 51(2): 168-175.
[26] AGIRRE E, ARREGI X, OTEGI A.Document expansion based on WordNet for robust IR[C]. Posters: In Proceedings of COLING 2010,2010: 9-17.
[27] EFRON M, ORGANISCIAK P, FENLON K.Improving retrieval of short texts through document expansion[C]// Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2012: 911-920.
[28] LIU X Y, CROFT W B.Cluster-based retrieval using language models[C]// Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2004: 186-193.
[29] GAO J F, NIE J Y, WU G Y, et al.Dependence language model for information retrieval[C]// Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2004: 170-177.
[30] GUO J F, CAI Y Q, FAN Y X, et al.Semantic models for the first-stage retrieval: A comprehensive review[J]. ACM transactions on information systems, 40(4)1-42.
[31] BOJANOWSKI P, GRAVE E, JOULIN A, et al.Enriching word vectors with subword information[J]. Transactions of the association for computational linguistics, 2017, 5: 135-146.
[32] MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed representations of words and phrases and their compositionality[J]. arXiv:1310.4546, 2013.
[33] PENNINGTON J, SOCHER R, MANNING C.Glove: Global vectors for word representation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Stroudsburg, PA, USA: Association for Computational Linguistics, 2014.
[34] DAI Z Y, CALLAN J.Context-aware sentence/passage term importance estimation for first stage retrieval[J]. arXiv: 1910.10687, 2019.
[35] NOGUEIRA R, YANG W, LIN J, et al.Document expansion by query prediction[J]. ArXiv: 1904.08375, 2019.
[36] GILLICK D, PRESTA A, TOMAR G S. End-to-end retrieval in continuous space[J]. arXiv:1811.08008, 2018.
[37] JANG K R, KANG J M, HONG G, et al.UHD-BERT: Bucketed ultra-high dimensional sparse representations for full ranking[J]. 2arXiv: 2104.07198, 2021.
[38] KHATTAB O, ZAHARIA M.ColBERT: Efficient and effective passage search via contextualized late interaction over BERT[C]// Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2020: 39-48.
[39] ZAMANI H, DEHGHANI M, CROFT W B, et al.From neural re-ranking to neural ranking: Learning a sparse representation for inverted indexing[C]// Proceedings of the 27th ACM International Conference on Information and Knowledge Management. New York: ACM, 2018: 497-506.
[40] BARONI M, DINU G, KRUSZEWSKI G.Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors[C]// Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA, USA: Association for Computational Linguistics, 2014.
[41] BENGIO Y, DUCHARME R, VINCENT P, et al.A neural probabilistic language model[J]. J Mach learn res, 2003, 3: 1137-1155.
[42] QIU X P, SUN T X, XU Y G, et al.Pre-trained models for natural lan-guage processing: A survey[J]. Science China technological sciences, 2020, 63(10): 1872-1897.
[43] 王东波, 刘畅, 朱子赫, 等. SikuBERT与SikuRoBERTa:面向数字人文的《四库全书》预训练模型构建及应用研究[J]. 图书馆论坛, 2022, 42(6): 31-43.
WANG D B, LIU C, ZHU Z H, et al.Construction and application of pre-trained models of siku Quanshu in orientation to digital humanities[J]. Library tribune, 2022, 42(6): 31-43.
[44] WANG P Y, REN Z C.The uncertainty-based retrieval framework for ancient Chinese CWS and POS[C]. Proceedings of the Second Workshop on Language Technologies for Historical and Ancient Languages, 2022: 164-8.
[45] DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv:1810.04805, 2018.
[46] 车万翔, 郭江, 崔一鸣. 自然语言处理: 基于预训练模型的方法[M]. 北京: 电子工业出版社, 2021.
CHE W X, GUO J, CUI Y M.Natural language processing[M]. Beijing: Publishing House of Electronics Industry, 2021.
[47] 邵浩, 刘一烽. 预训练语言模型[M]. 北京: 电子工业出版社, 2021.
SHAO H, LIU Y F. Pre-training language model[M]. Beijing: Publishing House of Electronics Industry, 2021.
[48] REIMERS N, GUREVYCH I.Sentence-BERT: Sentence embeddings using Siamese BERT-networks[C]// Proceedings of the 2019 Confer-ence on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Stroudsburg, PA, USA: Association for Computational Linguistics, 2019.
[49] GAO T Y, YAO X C, CHEN D Q.SimCSE: Simple contrastive learning of sentence embeddings[J]. arXiv: 2104.08821, 2021.
[1] LU Lina, YU Xiao. Recognition and Classification of Deep Learning in Soybean Leaf Image Data Management [J]. Journal of Library and Information Science in Agriculture, 2023, 35(2): 87-94.
[2] SHI Yunlai, CUI Yunpeng, DU Zhigang. A Classification Method of Agricultural News Text Based on BERT and Deep Active Learning [J]. Journal of Library and Information Science in Agriculture, 2022, 34(8): 19-29.
[3] HOU Xiangying, CUI Yunpeng, LIU Juan. Applications and Prospect Analysis of Deep Learning in Plant Genomics and Crop Breeding [J]. Journal of Library and Information Science in Agriculture, 2022, 34(8): 4-18.
[4] MAO Jin, CHEN Ziyang. A Deep Learning Based Approach to Structural Function Recognition of Scientific Literature Abstracts [J]. Journal of Library and Information Science in Agriculture, 2022, 34(3): 15-27.
[5] LYU Lucheng, HAN Tao. Artificial Intelligence Empowers Library and Information Service ——Review of Forums about Information Technology for Library 2019 [J]. Journal of Library and Information Science in Agriculture, 2020, 32(5): 13-18.
[6] WANG Xuejing. Research on Intelligent Service Mode of Digital Library Based on Deep Learning Technology [J]. , 2018, 30(9): 150-153.
[7] LIAO Wenguo, LIAO Guangping. Research on Semantic Retrieval Model and Key Ontology Construction of Ethnic Literature [J]. , 2017, 29(8): 56-58.
[8] YANG Hong-gang, Yang liu, Yang Chang-sheng. Studies on Author Cooperation Relationship of Semantic Retrieval Research in China based on Social Network Analysis [J]. , 2014, 26(2): 65-69.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!