多维特征文本复杂度框架与知识库增强模型

doi:10.13998/j.cnki.issn1002-1248.25-0365

Abstract

Abstract:

[Purpose/Significance] In cross-domain natural language processing (NLP) tasks, deep learning models often exhibit performance variations due to texts with distinct domain characteristics, leading to a decline in model generalization capabilities. Text complexity stands out as one of the most explanatory factors influencing model generalization. [Method/Process] This paper presents two innovative contributions. First, a multi-dimensional text complexity calculation framework grounded in systemic functional linguistics theory was constructed. This framework employs a hierarchical quantification approach: at the lexical level, it dynamically identified four types of non-standard expressions - abbreviations, emoticons, internet buzzwords, and alphanumeric mixed words - and calculated a normative score using a non-linear formula. At the sentence level, an innovative inverse fusion enhancement method (IFEM) was proposed, integrating punctuation anomaly density (weight 0.1), colloquial word ratio (weight 0.4), semantic ambiguity (weight 0.2), and sentence length features (weight 0.3), and generating a structural score through modeling of feature synergy and suppression effects along with an adaptive weighting mechanism. Finally, at the corpus level, a weighted fusion output the global corpus complexity assessment. Experimental results demonstrated that this framework successfully quantifies intrinsic differences between domain texts. For instance, the measured complexity of the waimai_10k dataset reached 0.703, significantly higher than the 0.552 of the ChnSentiCorp_htl_all dataset, and it accurately captured complexity changes even after internal text reduction and substitution operations. Second, a knowledge base-enhanced dynamic adaptive CNN-BiLSTM model was designed. This model implemented the following innovative mechanisms: 1) The knowledge base adopts a dual mapping architecture of text-label and vector-label, supporting historical experience knowledge loading and real-time error recording; 2) Feature weights were adjusted based on the knowledge base content, such as strengthening positive semantic representations or weakening negative expressions. The model architecture integrated multi-scale CNN convolutional kernels for local feature extraction, bidirectional long short-term memory networks for capturing long-distance dependencies, and an attention mechanism to focus on key information. To validate the effectiveness of the proposed methods, experiments were conducted on four Chinese datasets. [Results/Conclusions] The results indicate that the complexity calculation framework exhibits strong robustness, with complexity fluctuations below 3.3% after a 20% sample reduction, and a maximum complexity increase of 13.8% upon short text data injection. Moreover, the framework effectively quantifies and differentiates text complexities, as evidenced by the 0.703 complexity of the waimai_10k dataset compared to the 0.552 of the ChnSentiCorp_htl_all dataset. Additionally, the proposed model demonstrated optimal performance across both the most standardized ChnSentiCorp_htl_all dataset and the most challenging waimai_10k dataset (achieving accuracies of 0.923 8 and 0.943 4, respectively), significantly outperforming Transformer and various large language models such as deepseek-v3 and qwen-plus.

Key words: cross-domain sentiment analysis, text complexity quantification, knowledge base enhancement, deep learning, dynamic self-adaptation

CLC Number:

TP391

CHANG Hao, XU Taotao, LI Feng. A Multi-dimensional Feature Text Complexity Framework and Knowledge Base Augmentation Model[J].Journal of library and information science in agriculture, 2025, 37(8): 61-77.

Figures/Tables 13

Fig.1

Table 1

Table 2

Results of text complexity"

数据集	$α w$	$β w$	$λ (c)$	规范度
Chn	0.821	0.406	0.572	高
waimai	0.732	0.683	0.703	低
Chn_all	0.778	0.402	0.552	高
weibo	0.863	0.450	0.615	低

Table 2

Table 3

Table 4

Fig.2

Fig.3

Fig.4

Fig.5

Table 5

Table 6

Table 7

Table 8

References 39

[1]	LIU X, LI F, XIAO W. Measuring linguistic complexity in Chinese: An information-theoretic approach[J]. Humanities and social sciences communications, 2024, 11: 980.
[2]	PICKREN S E, STACY M, DEL TUFO S N, et al. The contribution of text characteristics to reading comprehension: Investigating the influence of text emotionality[J]. Reading research quarterly, 2022, 57(2): 649-667.
[3]	HALLIDAY M A K, MATTHIESSEN C M I M. Halliday's introduction to functional grammar[M]. London: Routledge, 2014.
[4]	MIESTAMO M, SINNEMÄKI K, KARLSSON F. Language complexity: Typology, contact, change[M]. Amsterdam: John Benjamins Publishing, 2008.
[5]	SWELLER J, VAN MERRIENBOER J J G, PAAS F G W C. Cognitive architecture and instructional design[J]. Educational psychology review, 1998, 10(3): 251-296.
[6]	BENJAMIN R G. Reconstructing readability: Recent developments and recommendations in the analysis of text difficulty[J]. Educational psychology review, 2012, 24(1): 63-88.
[7]	WANG Y X, BERWICK R C, LUO X F, et al. A formal measurement of the cognitive complexity of texts in cognitive linguistics[C]//2012 IEEE 11th International Conference on Cognitive Informatics and Cognitive Computing. August 22-24, 2012, Kyoto, Japan. IEEE, 2012: 94-102.
[8]	WANG X Z, KOU L Y, SUGUMARAN V, et al. Emotion correlation mining through deep learning models on natural language text[J]. IEEE transactions on cybernetics, 2021, 51(9): 4400-4413.
[9]	ZENG J C, LI J, SONG Y, et al. Topic memory networks for short text classification[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium. Stroudsburg, PA, USA: ACL, 2018: 3120-3131.
[10]	ZHANG Y H, ZHANG Y, GUO W Y, et al. Learning disentangled representation for multimodal cross-domain sentiment analysis[J]. IEEE transactions on neural networks and learning systems, 2023, 34(10): 7956-7966.
[11]	CHEN J D, HU Y Z, LIU J P, et al. Deep short text classification with knowledge powered attention[C]//Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence. ACM, 2019: 6252-6259.
[12]	MINAEE S, KALCHBRENNER N, CAMBRIA E, et al. Deep learning: Based text classification: A comprehensive review[J]. ACM computing surveys, 2021, 54(3): 1-40.
[13]	WANG Q L, WEN Z Y, DING K Y, et al. Cross-domain sentiment analysis via disentangled representation and prototypical learning[J]. IEEE transactions on affective computing, 2025, 16(1): 264-276.
[14]	DENG Y, ZHANG W X, PAN S J, et al. Bidirectional generative framework for cross-domain aspect-based sentiment analysis[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Toronto, Canada. Stroudsburg, PA, USA: ACL, 2023: 12272-12285.
[15]	TIWARI D, NAGPAL B. KEAHT: A knowledge-enriched attention-based hybrid transformer model for social sentiment analysis[J]. New generation computing, 2022, 40(4): 1165-1202.
[16]	LARSEN–FREEMAN D. On language learner agency: A complex dynamic systems theory perspective[J]. The modern language journal, 2019, 103(S1): 61-79.
[17]	LAUDAŃSKA Z, CAUNT A, CRISTIA A, et al. From data to discovery: Technology propels speech-language research and theory-building in developmental science[J]. Neuroscience & biobehavioral reviews, 2025, 174: 106199.
[18]	ELLIS N C. Essentials of a theory of language cognition[J]. The modern language journal, 2019, 103(S1): 39-60.
[19]	TAN H Z, XU C P, LI J, et al. HICL: Hashtag-driven in-context learning for social media natural language understanding[J]. IEEE transactions on neural networks and learning systems, 2025, 36(4): 7037-7050.
[20]	HUANG F L, LI X L, YUAN C G, et al. Attention-emotion-enhanced convolutional LSTM for sentiment analysis[J]. IEEE transactions on neural networks and learning systems, 2022, 33(9): 4332-4345.
[21]	NORTH K, ZAMPIERI M, SHARDLOW M. Lexical complexity prediction: An overview[J]. ACM computing surveys, 2023, 55(9): 1-42.
[22]	ZHANG X P, LU X F. Revisiting the predictive power of traditional vs. fine-grained syntactic complexity indices for L2 writing quality: The case of two genres[J]. Assessing writing, 2022, 51: 100597.
[23]	WANG W K, CHEN G H, WANG H Q, et al. Multilingual sentence transformer as a multilingual word aligner[C]//Findings of the Association for Computational Linguistics: EMNLP 2022. Abu Dhabi, United Arab Emirates. Stroudsburg, PA, USA: ACL, 2022: 2952-2963.
[24]	BARZILAY R, LAPATA M. Modeling local coherence: An entity-based approach[J]. Computational linguistics, 2008, 34(1): 1-34.
[25]	BLITZER J, DREDZE M, PEREIRA F C. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification[C]//Annual Meeting of the Association for Computational Linguistics., 2007
[26]	PAN S J, TSANG I W, KWOK J T, et al. Domain adaptation via transfer component analysis[J]. IEEE transactions on neural networks, 2011, 22(2): 199-210.
[27]	GANIN Y, LEMPITSKY V. Unsupervised domain adaptation by backpropagation[EB/OL]. 2014: arXiv: 1409.7495.
[28]	DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]//North American Chapter of the Association for Computational Linguistics., 2019
[29]	LIU W J, ZHOU P, ZHAO Z, et al. K-BERT: Enabling language representation with knowledge graph[J]. Proceedings of the AAAI conference on artificial intelligence, 2020, 34(3): 2901-2908.
[30]	KE P, JI H Z, LIU S Y, et al. SentiLARE: Sentiment-aware language representation learning with linguistic knowledge[EB/OL]. 2019: arXiv: 1911.02493.
[31]	SUN Y, WANG S H, LI Y K, et al. ERNIE: Enhanced representation through knowledge integration[EB/OL]. 2019: arXiv: 1904.09223.
[32]	PETERS M E, NEUMANN M, LOGAN R, et al. Knowledge enhanced contextual word representations[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China. Stroudsburg, PA, USA: ACL, 2019: 43-54.
[33]	SU Y S, HAN X, ZHANG Z Y, et al. CokeBERT: Contextual knowledge selection and embedding towards enhanced pre-trained language models[J]. AI open, 2021, 2: 127-134.
[34]	KIRKPATRICK J, PASCANU R, RABINOWITZ N, et al. Overcoming catastrophic forgetting in neural networks[J]. Proceedings of the national academy of sciences of the United States of America, 2017, 114(13): 3521-3526.
[35]	RAMSHANKAR N, P M J P. Automated sentimental analysis using heuristic-based CNN-BiLSTM for E-commerce dataset[J]. Data & knowledge engineering, 2023, 146: 102194.
[36]	ALLAM E G, MADBOULY M M, GUIRGUIS S K. Arabic language sentiment analysis using feature engineering and deep learning RNN-LSTM framework[C]//2021 31st International Conference on Computer Theory and Applications (ICCTA). Alexandria, Egypt. IEEE, 2022: 160-165.
[37]	HAN K, WANG Y H, CHEN H T, et al. A survey on vision transformer[J]. IEEE transactions on pattern analysis and machine intelligence, 2023, 45(1): 87-110.
[38]	TAY Y, DEHGHANI M, BAHRI D, et al. Efficient transformers: A survey[J]. ACM computing surveys, 2022, 55(6): 1-28.
[39]	WU H G, KONG D L, WANG L J, et al. Multimodal sentiment analysis method based on image-text quantum transformer[J]. Neurocomputing, 2025, 637: 130107.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

数据集	数据量/条	中位长	平均长	标准差	25%分位数	75%分位数
Chn	10 480	75	109.28	128.77	47	154
waimai	11 987	17	25.05	24.68	11	30
Chn_all	7 766	84	128.52	143.63	45	154
weibo	15 000	55	66.36	45.19	29	98

数据集	原复杂度	新数据量/条	新复杂度
Chn	0.572	8 480	0.557
waimai	0.703	9 987	0.680
Chn_all	0.552	5 766	0.538
weibo	0.615	13 000	0.607

数据集	复杂度	新数据量/条	新复杂度
Chn	0.572	12 480	0.627
waimai	0.703	13 987	0.768
Chn_all	0.552	9 766	0.628
weibo	0.615	17 000	0.645

模型	准确率	精确率	F1值	召回率
T5-transformer	0.506 7	0.508 4	0.502 8	0.505 6
Transformer	0.755 0	0.780 0	0.756 0	0.781 1
OH-CNN-BiLSTM	0.814 2	0.826 8	0.814 3	0.820 5
RNN-LSTM	0.881 7	0.868 7	0.881 4	0.875 0
GPT-3.5-Turbo	0.884 2	0.922 5	0.884 7	0.903 2
DeepSeek-v3	0.887 5	0.923 1	0.888 0	0.905 2
Qwen-Plus	0.889 2	0.932 6	0.889 8	0.910 7
CNN-BiLSTM	0.891 7	0.912 1	0.892 0	0.901 9
This Work	0.977 5	0.967 8	0.988 5	0.978 0

模型	准确率	精确率	F1值	召回率
T5-transformer	0.685 3	0.685 3	0.685 3	0.578 2
Transformer	0.778 6	0.920 7	0.740 8	0.821 0
GPT-3.5-Turbo	0.779 9	0.994 5	0.837 2	0.909 1
Qwen-Plus	0.783 1	0.993 2	0.839 0	0.909 6
DeepSeek-v3	0.794 7	0.993 4	0.847 5	0.914 6
RNN-LSTM	0.830 8	0.890 1	0.814 0	0.850 4
OH-CNN-BiLSTM	0.837 2	0.885 9	0.814 9	0.848 9
CNN-BiLSTM	0.848 1	0.871 1	0.809 6	0.839 2
This Work	0.943 4	0.968 4	0.940 4	0.954 2

A Multi-dimensional Feature Text Complexity Framework and Knowledge Base Augmentation Model

RichHTML

PDF (PC)

Abstract

Cite this article

share this article

Figures/Tables 13

References 39

Related Articles 10

Metrics

Comments

Recommended 0

模型	准确率	精确率	F1值	召回率
T5-transformer	0.507 5	0.538 8	0.586 8	0.467 1
OH-CNN-BiLSTM	0.813 6	0.698 2	0.804 3	0.747 7
CNN-BiLSTM	0.823 6	0.768 1	0.768 5	0.777 2
RNN-LSTM	0.831 1	0.741 7	0.812 7	0.775 6
Qwen-Plus	0.840 3	0.800 4	0.842 3	0.820 8
GPT-3.5-Turbo	0.847 0	0.805 1	0.849 1	0.826 5
Transformer	0.849 9	0.748 3	0.828 7	0.786 5
DeepSeek-v3	0.862 3	0.819 1	0.864 5	0.841 2
This Work	0.955 3	0.911 7	0.956 1	0.933 4

模型	准确率	精确率	F1值	召回率
T5-transformer	0.521 3	0.538 8	0.521 3	0.467 1
OH-CNN-BiLSTM	0.630 7	0.660 5	0.624 1	0.641 8
RNN-LSTM	0.643 7	0.621 1	0.644 1	0.632 4
CNN-BiLSTM	0.663 3	0.659 7	0.661 3	0.660 5
Transformer	0.849 9	0.748 3	0.828 7	0.786 5
GPT-3.5-Turbo	0.903 7	0.900 1	0.877 8	0.888 8
Qwen-Plus	0.904 5	0.896 0	0.880 3	0.888 1
DeepSeek-v3	0.912 4	0.903 0	0.890 9	0.896 9
This work	0.923 8	0.932 1	0.923 1	0.927 6

[1]	Yifan ZHANG, Zuqin CHEN, Jike GE, Mingkun HE, Jie TAN. Construction of a Multimodal Dataset for Emergency Event Identification and Classification [J]. Journal of library and information science in agriculture, 2024, 36(10): 76-85.
[2]	WANG Sili, ZHANG Ling, YANG Heng, LIU Wei. Review of Deep Learning for Language Modeling [J]. Journal of library and information science in agriculture, 2023, 35(8): 4-18.
[3]	LIU Nanzhu, CUI Yunpeng, WANG Mo. Construction and Application of Semantic Retrieval Model for Ancient Agricultural Literature [J]. Journal of library and information science in agriculture, 2023, 35(7): 52-62.
[4]	LU Lina, YU Xiao. Recognition and Classification of Deep Learning in Soybean Leaf Image Data Management [J]. Journal of library and information science in agriculture, 2023, 35(2): 87-94.
[5]	ZHANG Jiyang, ZHANG Peng, GONG Siyu, SONG Naipeng. Online Social Spammer Detection Based on Deep Learning [J]. Journal of library and information science in agriculture, 2023, 35(12): 49-59.
[6]	HOU Xiangying, CUI Yunpeng, LIU Juan. Applications and Prospect Analysis of Deep Learning in Plant Genomics and Crop Breeding [J]. Journal of library and information science in agriculture, 2022, 34(8): 4-18.
[7]	SHI Yunlai, CUI Yunpeng, DU Zhigang. A Classification Method of Agricultural News Text Based on BERT and Deep Active Learning [J]. Journal of library and information science in agriculture, 2022, 34(8): 19-29.
[8]	MAO Jin, CHEN Ziyang. A Deep Learning Based Approach to Structural Function Recognition of Scientific Literature Abstracts [J]. Journal of library and information science in agriculture, 2022, 34(3): 15-27.
[9]	LYU Lucheng, HAN Tao. Artificial Intelligence Empowers Library and Information Service ——Review of Forums about Information Technology for Library 2019 [J]. Journal of library and information science in agriculture, 2020, 32(5): 13-18.
[10]	WANG Xuejing. Research on Intelligent Service Mode of Digital Library Based on Deep Learning Technology [J]. Journal of library and information science in agriculture, 2018, 30(9): 150-153.