多维特征文本复杂度框架与知识库增强模型

doi:10.13998/j.cnki.issn1002-1248.25-0365

摘要/Abstract

摘要：

[目的/意义] 深度学习模型在自然语言处理（NLP）任务中所面临的跨领域泛化能力瓶颈，其根源于不同领域文本特征的内在文本复杂度。现有研究往往忽视了对文本复杂度进行系统性、理论性的量化。 [方法/过程] 基于系统功能语言学理论，构建多维特征文本复杂度计算框架，通过词语级非规范性、句子级结构性和语料级复杂度的非线性交互建模实现精确量化。设计基于知识库的动态自适应CNN-BiLSTM模型，采用双重映射机制，实现“错误记录→知识更新→权重调整→优先预测”的动态学习路径，融合多尺度CNN、BiLSTM和注意力机制。 [结果/结论] 在4个覆盖不同规范度的公开数据集上的实验证明：1）我们的多维特征复杂度计算框架有效揭示各数据集的内在复杂度差异；2）所提出的模型在所有任务中均取得最优性能表现，尤其是在最高复杂度的waimai数据集上，0.923 8的准确率超越包括LLMs在内的强基线模型，展现了卓越的鲁棒性。

关键词: 跨领域情感分析, 文本复杂度量化, 知识库增强, 深度学习, 动态自适应

Abstract:

[Purpose/Significance] In cross-domain natural language processing (NLP) tasks, deep learning models often exhibit performance variations due to texts with distinct domain characteristics, leading to a decline in model generalization capabilities. Text complexity stands out as one of the most explanatory factors influencing model generalization. [Method/Process] This paper presents two innovative contributions. First, a multi-dimensional text complexity calculation framework grounded in systemic functional linguistics theory was constructed. This framework employs a hierarchical quantification approach: at the lexical level, it dynamically identified four types of non-standard expressions - abbreviations, emoticons, internet buzzwords, and alphanumeric mixed words - and calculated a normative score using a non-linear formula. At the sentence level, an innovative inverse fusion enhancement method (IFEM) was proposed, integrating punctuation anomaly density (weight 0.1), colloquial word ratio (weight 0.4), semantic ambiguity (weight 0.2), and sentence length features (weight 0.3), and generating a structural score through modeling of feature synergy and suppression effects along with an adaptive weighting mechanism. Finally, at the corpus level, a weighted fusion output the global corpus complexity assessment. Experimental results demonstrated that this framework successfully quantifies intrinsic differences between domain texts. For instance, the measured complexity of the waimai_10k dataset reached 0.703, significantly higher than the 0.552 of the ChnSentiCorp_htl_all dataset, and it accurately captured complexity changes even after internal text reduction and substitution operations. Second, a knowledge base-enhanced dynamic adaptive CNN-BiLSTM model was designed. This model implemented the following innovative mechanisms: 1) The knowledge base adopts a dual mapping architecture of text-label and vector-label, supporting historical experience knowledge loading and real-time error recording; 2) Feature weights were adjusted based on the knowledge base content, such as strengthening positive semantic representations or weakening negative expressions. The model architecture integrated multi-scale CNN convolutional kernels for local feature extraction, bidirectional long short-term memory networks for capturing long-distance dependencies, and an attention mechanism to focus on key information. To validate the effectiveness of the proposed methods, experiments were conducted on four Chinese datasets. [Results/Conclusions] The results indicate that the complexity calculation framework exhibits strong robustness, with complexity fluctuations below 3.3% after a 20% sample reduction, and a maximum complexity increase of 13.8% upon short text data injection. Moreover, the framework effectively quantifies and differentiates text complexities, as evidenced by the 0.703 complexity of the waimai_10k dataset compared to the 0.552 of the ChnSentiCorp_htl_all dataset. Additionally, the proposed model demonstrated optimal performance across both the most standardized ChnSentiCorp_htl_all dataset and the most challenging waimai_10k dataset (achieving accuracies of 0.923 8 and 0.943 4, respectively), significantly outperforming Transformer and various large language models such as deepseek-v3 and qwen-plus.

Key words: cross-domain sentiment analysis, text complexity quantification, knowledge base enhancement, deep learning, dynamic self-adaptation

中图分类号: TP391

常郝, 徐涛涛, 李峰. 多维特征文本复杂度框架与知识库增强模型[J/OL]. 农业图书情报学报. https://doi.org/10.13998/j.cnki.issn1002-1248.25-0365.

CHANG Hao, XU Taotao, LI Feng. A Multi-dimensional Feature Text Complexity Framework and Knowledge Base Augmentation Model[J/OL]. Journal of library and information science in agriculture. https://doi.org/10.13998/j.cnki.issn1002-1248.25-0365.

图/表 13

表1

图1

表2

文本复杂度结果"

数据集	$α w$	$β w$	$λ (c)$	规范度
Chn	0.821	0.406	0.572	高
waimai	0.732	0.683	0.703	低
Chn_all	0.778	0.402	0.552	高
weibo	0.863	0.450	0.615	低

表2

表3

表4

图2

图3

图4

图5

表5

表6

表7

表8

参考文献 39

[1]	LIU X, LI F, XIAO W. Measuring linguistic complexity in Chinese: An information-theoretic approach[J]. Humanities and social sciences communications, 2024, 11: 980.
[2]	PICKREN S E, STACY M, DEL TUFO S N, et al. The contribution of text characteristics to reading comprehension: Investigating the influence of text emotionality[J]. Reading research quarterly, 2022, 57(2): 649-667.
[3]	HALLIDAY M A K, MATTHIESSEN C M I M. Halliday's introduction to functional grammar[M]. London: Routledge, 2014.
[4]	MIESTAMO M, SINNEMÄKI K, KARLSSON F. Language complexity: Typology, contact, change[M]. Amsterdam: John Benjamins Publishing, 2008.
[5]	SWELLER J, VAN MERRIENBOER J J G, PAAS F G W C. Cognitive architecture and instructional design[J]. Educational psychology review, 1998, 10(3): 251-296.
[6]	BENJAMIN R G. Reconstructing readability: Recent developments and recommendations in the analysis of text difficulty[J]. Educational psychology review, 2012, 24(1): 63-88.
[7]	WANG Y X, BERWICK R C, LUO X F, et al. A formal measurement of the cognitive complexity of texts in cognitive linguistics[C]//2012 IEEE 11th International Conference on Cognitive Informatics and Cognitive Computing. August 22-24, 2012, Kyoto, Japan. IEEE, 2012: 94-102.
[8]	WANG X Z, KOU L Y, SUGUMARAN V, et al. Emotion correlation mining through deep learning models on natural language text[J]. IEEE transactions on cybernetics, 2021, 51(9): 4400-4413.
[9]	ZENG J C, LI J, SONG Y, et al. Topic memory networks for short text classification[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium. Stroudsburg, PA, USA: ACL, 2018: 3120-3131.
[10]	ZHANG Y H, ZHANG Y, GUO W Y, et al. Learning disentangled representation for multimodal cross-domain sentiment analysis[J]. IEEE transactions on neural networks and learning systems, 2023, 34(10): 7956-7966.
[11]	CHEN J D, HU Y Z, LIU J P, et al. Deep short text classification with knowledge powered attention[C]//Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence. ACM, 2019: 6252-6259.
[12]	MINAEE S, KALCHBRENNER N, CAMBRIA E, et al. Deep learning: Based text classification: A comprehensive review[J]. ACM computing surveys, 2021, 54(3): 1-40.
[13]	WANG Q L, WEN Z Y, DING K Y, et al. Cross-domain sentiment analysis via disentangled representation and prototypical learning[J]. IEEE transactions on affective computing, 2025, 16(1): 264-276.
[14]	DENG Y, ZHANG W X, PAN S J, et al. Bidirectional generative framework for cross-domain aspect-based sentiment analysis[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Toronto, Canada. Stroudsburg, PA, USA: ACL, 2023: 12272-12285.
[15]	TIWARI D, NAGPAL B. KEAHT: A knowledge-enriched attention-based hybrid transformer model for social sentiment analysis[J]. New generation computing, 2022, 40(4): 1165-1202.
[16]	LARSEN–FREEMAN D. On language learner agency: A complex dynamic systems theory perspective[J]. The modern language journal, 2019, 103(S1): 61-79.
[17]	LAUDAŃSKA Z, CAUNT A, CRISTIA A, et al. From data to discovery: Technology propels speech-language research and theory-building in developmental science[J]. Neuroscience & biobehavioral reviews, 2025, 174: 106199.
[18]	ELLIS N C. Essentials of a theory of language cognition[J]. The modern language journal, 2019, 103(S1): 39-60.
[19]	TAN H Z, XU C P, LI J, et al. HICL: Hashtag-driven in-context learning for social media natural language understanding[J]. IEEE transactions on neural networks and learning systems, 2025, 36(4): 7037-7050.
[20]	HUANG F L, LI X L, YUAN C G, et al. Attention-emotion-enhanced convolutional LSTM for sentiment analysis[J]. IEEE transactions on neural networks and learning systems, 2022, 33(9): 4332-4345.
[21]	NORTH K, ZAMPIERI M, SHARDLOW M. Lexical complexity prediction: An overview[J]. ACM computing surveys, 2023, 55(9): 1-42.
[22]	ZHANG X P, LU X F. Revisiting the predictive power of traditional vs. fine-grained syntactic complexity indices for L2 writing quality: The case of two genres[J]. Assessing writing, 2022, 51: 100597.
[23]	WANG W K, CHEN G H, WANG H Q, et al. Multilingual sentence transformer as a multilingual word aligner[C]//Findings of the Association for Computational Linguistics: EMNLP 2022. Abu Dhabi, United Arab Emirates. Stroudsburg, PA, USA: ACL, 2022: 2952-2963.
[24]	BARZILAY R, LAPATA M. Modeling local coherence: An entity-based approach[J]. Computational linguistics, 2008, 34(1): 1-34.
[25]	BLITZER J, DREDZE M, PEREIRA F C. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification[C]//Annual Meeting of the Association for Computational Linguistics., 2007
[26]	PAN S J, TSANG I W, KWOK J T, et al. Domain adaptation via transfer component analysis[J]. IEEE transactions on neural networks, 2011, 22(2): 199-210.
[27]	GANIN Y, LEMPITSKY V. Unsupervised domain adaptation by backpropagation[EB/OL]. 2014: arXiv: 1409.7495.
[28]	DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]//North American Chapter of the Association for Computational Linguistics., 2019
[29]	LIU W J, ZHOU P, ZHAO Z, et al. K-BERT: Enabling language representation with knowledge graph[J]. Proceedings of the AAAI conference on artificial intelligence, 2020, 34(3): 2901-2908.
[30]	KE P, JI H Z, LIU S Y, et al. SentiLARE: Sentiment-aware language representation learning with linguistic knowledge[EB/OL]. 2019: arXiv: 1911.02493.
[31]	SUN Y, WANG S H, LI Y K, et al. ERNIE: Enhanced representation through knowledge integration[EB/OL]. 2019: arXiv: 1904.09223.
[32]	PETERS M E, NEUMANN M, LOGAN R, et al. Knowledge enhanced contextual word representations[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China. Stroudsburg, PA, USA: ACL, 2019: 43-54.
[33]	SU Y S, HAN X, ZHANG Z Y, et al. CokeBERT: Contextual knowledge selection and embedding towards enhanced pre-trained language models[J]. AI open, 2021, 2: 127-134.
[34]	KIRKPATRICK J, PASCANU R, RABINOWITZ N, et al. Overcoming catastrophic forgetting in neural networks[J]. Proceedings of the national academy of sciences of the United States of America, 2017, 114(13): 3521-3526.
[35]	RAMSHANKAR N, P M J P. Automated sentimental analysis using heuristic-based CNN-BiLSTM for E-commerce dataset[J]. Data & knowledge engineering, 2023, 146: 102194.
[36]	ALLAM E G, MADBOULY M M, GUIRGUIS S K. Arabic language sentiment analysis using feature engineering and deep learning RNN-LSTM framework[C]//2021 31st International Conference on Computer Theory and Applications (ICCTA). Alexandria, Egypt. IEEE, 2022: 160-165.
[37]	HAN K, WANG Y H, CHEN H T, et al. A survey on vision transformer[J]. IEEE transactions on pattern analysis and machine intelligence, 2023, 45(1): 87-110.
[38]	TAY Y, DEHGHANI M, BAHRI D, et al. Efficient transformers: A survey[J]. ACM computing surveys, 2022, 55(6): 1-28.
[39]	WU H G, KONG D L, WANG L J, et al. Multimodal sentiment analysis method based on image-text quantum transformer[J]. Neurocomputing, 2025, 637: 130107.

数据集	数据量/条	中位长	平均长	标准差	25%分位数	75%分位数
Chn	10 480	75	109.28	128.77	47	154
waimai	11 987	17	25.05	24.68	11	30
Chn_all	7 766	84	128.52	143.63	45	154
weibo	15 000	55	66.36	45.19	29	98

数据集	原复杂度	新数据量/条	新复杂度
Chn	0.572	8 480	0.557
waimai	0.703	9 987	0.680
Chn_all	0.552	5 766	0.538
weibo	0.615	13 000	0.607

数据集	复杂度	新数据量/条	新复杂度
Chn	0.572	12 480	0.627
waimai	0.703	13 987	0.768
Chn_all	0.552	9 766	0.628
weibo	0.615	17 000	0.645

模型	准确率	精确率	F1值	召回率
T5-transformer	0.506 7	0.508 4	0.502 8	0.505 6
Transformer	0.755 0	0.780 0	0.756 0	0.781 1
OH-CNN-BiLSTM	0.814 2	0.826 8	0.814 3	0.820 5
RNN-LSTM	0.881 7	0.868 7	0.881 4	0.875 0
GPT-3.5-Turbo	0.884 2	0.922 5	0.884 7	0.903 2
DeepSeek-v3	0.887 5	0.923 1	0.888 0	0.905 2
Qwen-Plus	0.889 2	0.932 6	0.889 8	0.910 7
CNN-BiLSTM	0.891 7	0.912 1	0.892 0	0.901 9
This Work	0.977 5	0.967 8	0.988 5	0.978 0

模型	准确率	精确率	F1值	召回率
T5-transformer	0.685 3	0.685 3	0.685 3	0.578 2
Transformer	0.778 6	0.920 7	0.740 8	0.821 0
GPT-3.5-Turbo	0.779 9	0.994 5	0.837 2	0.909 1
Qwen-Plus	0.783 1	0.993 2	0.839 0	0.909 6
DeepSeek-v3	0.794 7	0.993 4	0.847 5	0.914 6
RNN-LSTM	0.830 8	0.890 1	0.814 0	0.850 4
OH-CNN-BiLSTM	0.837 2	0.885 9	0.814 9	0.848 9
CNN-BiLSTM	0.848 1	0.871 1	0.809 6	0.839 2
This Work	0.943 4	0.968 4	0.940 4	0.954 2