中文    English

Journal of library and information science in agriculture

   

A Multi-dimensional Feature Text Complexity Framework and Knowledge Base Augmentation Model

CHANG Hao1, XU Taotao2, LI Feng1   

  1. 1. School of Computer and Information Engineering, Anhui University of Finance and Economics, Bengbu 233030
    2. School of Management Science and Engineering, Anhui University of Finance and Economics, Bengbu 233000
  • Received:2025-07-05 Online:2025-09-22

Abstract:

[Purpose/Significance] In cross-domain natural language processing (NLP) tasks, deep learning models often exhibit performance variations due to texts with distinct domain characteristics, leading to a decline in model generalization capabilities. Text complexity stands out as one of the most explanatory factors influencing model generalization. [Method/Process] This paper presents two innovative contributions. First, a multi-dimensional text complexity calculation framework grounded in systemic functional linguistics theory was constructed. This framework employs a hierarchical quantification approach: at the lexical level, it dynamically identified four types of non-standard expressions - abbreviations, emoticons, internet buzzwords, and alphanumeric mixed words - and calculated a normative score using a non-linear formula. At the sentence level, an innovative inverse fusion enhancement method (IFEM) was proposed, integrating punctuation anomaly density (weight 0.1), colloquial word ratio (weight 0.4), semantic ambiguity (weight 0.2), and sentence length features (weight 0.3), and generating a structural score through modeling of feature synergy and suppression effects along with an adaptive weighting mechanism. Finally, at the corpus level, a weighted fusion output the global corpus complexity assessment. Experimental results demonstrated that this framework successfully quantifies intrinsic differences between domain texts. For instance, the measured complexity of the waimai_10k dataset reached 0.703, significantly higher than the 0.552 of the ChnSentiCorp_htl_all dataset, and it accurately captured complexity changes even after internal text reduction and substitution operations. Second, a knowledge base-enhanced dynamic adaptive CNN-BiLSTM model was designed. This model implemented the following innovative mechanisms: 1) The knowledge base adopts a dual mapping architecture of text-label and vector-label, supporting historical experience knowledge loading and real-time error recording; 2) Feature weights were adjusted based on the knowledge base content, such as strengthening positive semantic representations or weakening negative expressions. The model architecture integrated multi-scale CNN convolutional kernels for local feature extraction, bidirectional long short-term memory networks for capturing long-distance dependencies, and an attention mechanism to focus on key information. To validate the effectiveness of the proposed methods, experiments were conducted on four Chinese datasets. [Results/Conclusions] The results indicate that the complexity calculation framework exhibits strong robustness, with complexity fluctuations below 3.3% after a 20% sample reduction, and a maximum complexity increase of 13.8% upon short text data injection. Moreover, the framework effectively quantifies and differentiates text complexities, as evidenced by the 0.703 complexity of the waimai_10k dataset compared to the 0.552 of the ChnSentiCorp_htl_all dataset. Additionally, the proposed model demonstrated optimal performance across both the most standardized ChnSentiCorp_htl_all dataset and the most challenging waimai_10k dataset (achieving accuracies of 0.923 8 and 0.943 4, respectively), significantly outperforming Transformer and various large language models such as deepseek-v3 and qwen-plus.

Key words: cross-domain sentiment analysis, text complexity quantification, knowledge base enhancement, deep learning, dynamic self-adaptation

CLC Number: 

  • TP391

Table 1

Summary of datasets"

数据集 数据 量/条 中位长 平均长 标准差 25%分位数 75%分位数
Chn 10 480 75 109.28 128.77 47 154
waimai 11 987 17 25.05 24.68 11 30
Chn_all 7 766 84 128.52 143.63 45 154
weibo 15 000 55 66.36 45.19 29 98

Fig.1

Statistics of each dataset"

Table 2

Results of text complexity"

数据集 α w β w λ ( c ) 规范度
Chn 0.821 0.406 0.572
waimai 0.732 0.683 0.703
Chn_all 0.778 0.402 0.552
weibo 0.863 0.450 0.615

Table 3

Complexity of the dataset after internal text deletion"

数据集 原复杂度 新数据量/条 新复杂度
Chn 0.572 8 480 0.557
waimai 0.703 9 987 0.680
Chn_all 0.552 5 766 0.538
weibo 0.615 13 000 0.607

Table 4

Text complexity after internal text replacement"

数据集 复杂度 新数据量/条 新复杂度
Chn 0.572 12 480 0.627
waimai 0.703 13 987 0.768
Chn_all 0.552 9 766 0.628
weibo 0.615 17 000 0.645

Fig.2

Core architecture of the proposed model"

Fig.3

Architecture layers of CNN-BiLSTM"

Fig.4

Application mechanism of knowledge base"

Fig.5

Bidirectional long short-term memory layer"

Table 5

Comparison of various models on the Chn dataset"

模型 准确率 精确率 F1值 召回率
T5-transformer 0.506 7 0.508 4 0.502 8 0.505 6
Transformer 0.755 0 0.780 0 0.756 0 0.781 1
OH-CNN-BiLSTM 0.814 2 0.826 8 0.814 3 0.820 5
RNN-LSTM 0.881 7 0.868 7 0.881 4 0.875 0
GPT-3.5-Turbo 0.884 2 0.922 5 0.884 7 0.903 2
DeepSeek-v3 0.887 5 0.923 1 0.888 0 0.905 2
Qwen-Plus 0.889 2 0.932 6 0.889 8 0.910 7
CNN-BiLSTM 0.891 7 0.912 1 0.892 0 0.901 9
This Work 0.977 5 0.967 8 0.988 5 0.978 0

Table 6

Performance of various models on the Chn_all dataset"

模型 准确率 精确率 F1值 召回率
T5-transformer 0.685 3 0.685 3 0.685 3 0.578 2
Transformer 0.778 6 0.920 7 0.740 8 0.821 0
GPT-3.5-Turbo 0.779 9 0.994 5 0.837 2 0.909 1
Qwen-Plus 0.783 1 0.993 2 0.839 0 0.909 6
DeepSeek-v3 0.794 7 0.993 4 0.847 5 0.914 6
RNN-LSTM 0.830 8 0.890 1 0.814 0 0.850 4
OH-CNN-BiLSTM 0.837 2 0.885 9 0.814 9 0.848 9
CNN-BiLSTM 0.848 1 0.871 1 0.809 6 0.839 2
This Work 0.943 4 0.968 4 0.940 4 0.954 2

Table 7

Performance of various models on the weibo dataset"

模型 准确率 精确率 F1值 召回率
T5-transformer 0.507 5 0.538 8 0.586 8 0.467 1
OH-CNN-BiLSTM 0.813 6 0.698 2 0.804 3 0.747 7
CNN-BiLSTM 0.823 6 0.768 1 0.768 5 0.777 2
RNN-LSTM 0.831 1 0.741 7 0.812 7 0.775 6
Qwen-Plus 0.840 3 0.800 4 0.842 3 0.820 8
GPT-3.5-Turbo 0.847 0 0.805 1 0.849 1 0.826 5
Transformer 0.849 9 0.748 3 0.828 7 0.786 5
DeepSeek-v3 0.862 3 0.819 1 0.864 5 0.841 2
This Work 0.955 3 0.911 7 0.956 1 0.933 4

Table 8

Performance of each model in the waimai dataset"

模型 准确率 精确率 F1值 召回率
T5-transformer 0.521 3 0.538 8 0.521 3 0.467 1
OH-CNN-BiLSTM 0.630 7 0.660 5 0.624 1 0.641 8
RNN-LSTM 0.643 7 0.621 1 0.644 1 0.632 4
CNN-BiLSTM 0.663 3 0.659 7 0.661 3 0.660 5
Transformer 0.849 9 0.748 3 0.828 7 0.786 5
GPT-3.5-Turbo 0.903 7 0.900 1 0.877 8 0.888 8
Qwen-Plus 0.904 5 0.896 0 0.880 3 0.888 1
DeepSeek-v3 0.912 4 0.903 0 0.890 9 0.896 9
This work 0.923 8 0.932 1 0.923 1 0.927 6
[1]
LIU X, LI F, XIAO W. Measuring linguistic complexity in Chinese: An information-theoretic approach[J]. Humanities and social sciences communications, 2024, 11: 980.
[2]
PICKREN S E, STACY M, DEL TUFO S N, et al. The contribution of text characteristics to reading comprehension: Investigating the influence of text emotionality[J]. Reading research quarterly, 2022, 57(2): 649-667.
[3]
HALLIDAY M A K, MATTHIESSEN C M I M. Halliday's introduction to functional grammar[M]. London: Routledge, 2014.
[4]
MIESTAMO M, SINNEMÄKI K, KARLSSON F. Language complexity: Typology, contact, change[M]. Amsterdam: John Benjamins Publishing, 2008.
[5]
SWELLER J, VAN MERRIENBOER J J G, PAAS F G W C. Cognitive architecture and instructional design[J]. Educational psychology review, 1998, 10(3): 251-296.
[6]
BENJAMIN R G. Reconstructing readability: Recent developments and recommendations in the analysis of text difficulty[J]. Educational psychology review, 2012, 24(1): 63-88.
[7]
WANG Y X, BERWICK R C, LUO X F, et al. A formal measurement of the cognitive complexity of texts in cognitive linguistics[C]//2012 IEEE 11th International Conference on Cognitive Informatics and Cognitive Computing. August 22-24, 2012, Kyoto, Japan. IEEE, 2012: 94-102.
[8]
WANG X Z, KOU L Y, SUGUMARAN V, et al. Emotion correlation mining through deep learning models on natural language text[J]. IEEE transactions on cybernetics, 2021, 51(9): 4400-4413.
[9]
ZENG J C, LI J, SONG Y, et al. Topic memory networks for short text classification[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium. Stroudsburg, PA, USA: ACL, 2018: 3120-3131.
[10]
ZHANG Y H, ZHANG Y, GUO W Y, et al. Learning disentangled representation for multimodal cross-domain sentiment analysis[J]. IEEE transactions on neural networks and learning systems, 2023, 34(10): 7956-7966.
[11]
CHEN J D, HU Y Z, LIU J P, et al. Deep short text classification with knowledge powered attention[C]//Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence. ACM, 2019: 6252-6259.
[12]
MINAEE S, KALCHBRENNER N, CAMBRIA E, et al. Deep learning: Based text classification: A comprehensive review[J]. ACM computing surveys, 2021, 54(3): 1-40.
[13]
WANG Q L, WEN Z Y, DING K Y, et al. Cross-domain sentiment analysis via disentangled representation and prototypical learning[J]. IEEE transactions on affective computing, 2025, 16(1): 264-276.
[14]
DENG Y, ZHANG W X, PAN S J, et al. Bidirectional generative framework for cross-domain aspect-based sentiment analysis[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Toronto, Canada. Stroudsburg, PA, USA: ACL, 2023: 12272-12285.
[15]
TIWARI D, NAGPAL B. KEAHT: A knowledge-enriched attention-based hybrid transformer model for social sentiment analysis[J]. New generation computing, 2022, 40(4): 1165-1202.
[16]
LARSEN–FREEMAN D. On language learner agency: A complex dynamic systems theory perspective[J]. The modern language journal, 2019, 103(S1): 61-79.
[17]
LAUDAŃSKA Z, CAUNT A, CRISTIA A, et al. From data to discovery: Technology propels speech-language research and theory-building in developmental science[J]. Neuroscience & biobehavioral reviews, 2025, 174: 106199.
[18]
ELLIS N C. Essentials of a theory of language cognition[J]. The modern language journal, 2019, 103(S1): 39-60.
[19]
TAN H Z, XU C P, LI J, et al. HICL: Hashtag-driven in-context learning for social media natural language understanding[J]. IEEE transactions on neural networks and learning systems, 2025, 36(4): 7037-7050.
[20]
HUANG F L, LI X L, YUAN C G, et al. Attention-emotion-enhanced convolutional LSTM for sentiment analysis[J]. IEEE transactions on neural networks and learning systems, 2022, 33(9): 4332-4345.
[21]
NORTH K, ZAMPIERI M, SHARDLOW M. Lexical complexity prediction: An overview[J]. ACM computing surveys, 2023, 55(9): 1-42.
[22]
ZHANG X P, LU X F. Revisiting the predictive power of traditional vs. fine-grained syntactic complexity indices for L2 writing quality: The case of two genres[J]. Assessing writing, 2022, 51: 100597.
[23]
WANG W K, CHEN G H, WANG H Q, et al. Multilingual sentence transformer as a multilingual word aligner[C]//Findings of the Association for Computational Linguistics: EMNLP 2022. Abu Dhabi, United Arab Emirates. Stroudsburg, PA, USA: ACL, 2022: 2952-2963.
[24]
BARZILAY R, LAPATA M. Modeling local coherence: An entity-based approach[J]. Computational linguistics, 2008, 34(1): 1-34.
[25]
BLITZER J, DREDZE M, PEREIRA F C. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification[C]//Annual Meeting of the Association for Computational Linguistics., 2007
[26]
PAN S J, TSANG I W, KWOK J T, et al. Domain adaptation via transfer component analysis[J]. IEEE transactions on neural networks, 2011, 22(2): 199-210.
[27]
GANIN Y, LEMPITSKY V. Unsupervised domain adaptation by backpropagation[EB/OL]. 2014: arXiv: 1409.7495.
[28]
DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]//North American Chapter of the Association for Computational Linguistics., 2019
[29]
LIU W J, ZHOU P, ZHAO Z, et al. K-BERT: Enabling language representation with knowledge graph[J]. Proceedings of the AAAI conference on artificial intelligence, 2020, 34(3): 2901-2908.
[30]
KE P, JI H Z, LIU S Y, et al. SentiLARE: Sentiment-aware language representation learning with linguistic knowledge[EB/OL]. 2019: arXiv: 1911.02493.
[31]
SUN Y, WANG S H, LI Y K, et al. ERNIE: Enhanced representation through knowledge integration[EB/OL]. 2019: arXiv: 1904.09223.
[32]
PETERS M E, NEUMANN M, LOGAN R, et al. Knowledge enhanced contextual word representations[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China. Stroudsburg, PA, USA: ACL, 2019: 43-54.
[33]
SU Y S, HAN X, ZHANG Z Y, et al. CokeBERT: Contextual knowledge selection and embedding towards enhanced pre-trained language models[J]. AI open, 2021, 2: 127-134.
[34]
KIRKPATRICK J, PASCANU R, RABINOWITZ N, et al. Overcoming catastrophic forgetting in neural networks[J]. Proceedings of the national academy of sciences of the United States of America, 2017, 114(13): 3521-3526.
[35]
RAMSHANKAR N, P M J P. Automated sentimental analysis using heuristic-based CNN-BiLSTM for E-commerce dataset[J]. Data & knowledge engineering, 2023, 146: 102194.
[36]
ALLAM E G, MADBOULY M M, GUIRGUIS S K. Arabic language sentiment analysis using feature engineering and deep learning RNN-LSTM framework[C]//2021 31st International Conference on Computer Theory and Applications (ICCTA). Alexandria, Egypt. IEEE, 2022: 160-165.
[37]
HAN K, WANG Y H, CHEN H T, et al. A survey on vision transformer[J]. IEEE transactions on pattern analysis and machine intelligence, 2023, 45(1): 87-110.
[38]
TAY Y, DEHGHANI M, BAHRI D, et al. Efficient transformers: A survey[J]. ACM computing surveys, 2022, 55(6): 1-28.
[39]
WU H G, KONG D L, WANG L J, et al. Multimodal sentiment analysis method based on image-text quantum transformer[J]. Neurocomputing, 2025, 637: 130107.
[1] Yifan ZHANG, Zuqin CHEN, Jike GE, Mingkun HE, Jie TAN. Construction of a Multimodal Dataset for Emergency Event Identification and Classification [J]. Journal of library and information science in agriculture, 2024, 36(10): 76-85.
[2] WANG Sili, ZHANG Ling, YANG Heng, LIU Wei. Review of Deep Learning for Language Modeling [J]. Journal of library and information science in agriculture, 2023, 35(8): 4-18.
[3] LIU Nanzhu, CUI Yunpeng, WANG Mo. Construction and Application of Semantic Retrieval Model for Ancient Agricultural Literature [J]. Journal of library and information science in agriculture, 2023, 35(7): 52-62.
[4] LU Lina, YU Xiao. Recognition and Classification of Deep Learning in Soybean Leaf Image Data Management [J]. Journal of library and information science in agriculture, 2023, 35(2): 87-94.
[5] ZHANG Jiyang, ZHANG Peng, GONG Siyu, SONG Naipeng. Online Social Spammer Detection Based on Deep Learning [J]. Journal of library and information science in agriculture, 2023, 35(12): 49-59.
[6] HOU Xiangying, CUI Yunpeng, LIU Juan. Applications and Prospect Analysis of Deep Learning in Plant Genomics and Crop Breeding [J]. Journal of library and information science in agriculture, 2022, 34(8): 4-18.
[7] SHI Yunlai, CUI Yunpeng, DU Zhigang. A Classification Method of Agricultural News Text Based on BERT and Deep Active Learning [J]. Journal of library and information science in agriculture, 2022, 34(8): 19-29.
[8] MAO Jin, CHEN Ziyang. A Deep Learning Based Approach to Structural Function Recognition of Scientific Literature Abstracts [J]. Journal of library and information science in agriculture, 2022, 34(3): 15-27.
[9] LYU Lucheng, HAN Tao. Artificial Intelligence Empowers Library and Information Service ——Review of Forums about Information Technology for Library 2019 [J]. Journal of library and information science in agriculture, 2020, 32(5): 13-18.
[10] WANG Xuejing. Research on Intelligent Service Mode of Digital Library Based on Deep Learning Technology [J]. Journal of library and information science in agriculture, 2018, 30(9): 150-153.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!