农业图书情报学报

• •    

多维特征文本复杂度框架与知识库增强模型

常郝1, 徐涛涛2, 李峰1   

  1. 1. 安徽财经大学 计算机与信息工程学院,蚌埠 233000
    2. 安徽财经大学 管理科学与工程学院,蚌埠 233000
  • 收稿日期:2025-07-05 出版日期:2025-09-22
  • 作者简介:

    常郝(1983- ),男,博士,教授,研究方向为人工智能、三维集成电路测试与容错

    徐涛涛(2000- ),男,硕士研究生,研究方向为自然语言处理、人工智能

    李峰(1983- ),男,博士,副教授,研究方向为音频源分离、语音情感识别以及语音信号处理

  • 基金资助:
    国家自然科学基金“基于延迟特征的三维集成电路硅通孔测试关键技术研究”(61704001); 安徽省自然科学基金“面向TSV延迟测试的3D芯片可靠性和良率提升方法研究”(1808085QF196); 安徽省高校自然科学研究项目“多模态数据融合的语音情感识别研究”(2024AH050018)

A Multi-dimensional Feature Text Complexity Framework and Knowledge Base Augmentation Model

CHANG Hao1, XU Taotao2, LI Feng1   

  1. 1. School of Computer and Information Engineering, Anhui University of Finance and Economics, Bengbu 233030
    2. School of Management Science and Engineering, Anhui University of Finance and Economics, Bengbu 233000
  • Received:2025-07-05 Online:2025-09-22

摘要:

[目的/意义] 深度学习模型在自然语言处理(NLP)任务中所面临的跨领域泛化能力瓶颈,其根源于不同领域文本特征的内在文本复杂度。现有研究往往忽视了对文本复杂度进行系统性、理论性的量化。 [方法/过程] 基于系统功能语言学理论,构建多维特征文本复杂度计算框架,通过词语级非规范性、句子级结构性和语料级复杂度的非线性交互建模实现精确量化。设计基于知识库的动态自适应CNN-BiLSTM模型,采用双重映射机制,实现“错误记录→知识更新→权重调整→优先预测”的动态学习路径,融合多尺度CNN、BiLSTM和注意力机制。 [结果/结论] 在4个覆盖不同规范度的公开数据集上的实验证明:1)我们的多维特征复杂度计算框架有效揭示各数据集的内在复杂度差异;2)所提出的模型在所有任务中均取得最优性能表现,尤其是在最高复杂度的waimai数据集上,0.923 8的准确率超越包括LLMs在内的强基线模型,展现了卓越的鲁棒性。

关键词: 跨领域情感分析, 文本复杂度量化, 知识库增强, 深度学习, 动态自适应

Abstract:

[Purpose/Significance] In cross-domain natural language processing (NLP) tasks, deep learning models often exhibit performance variations due to texts with distinct domain characteristics, leading to a decline in model generalization capabilities. Text complexity stands out as one of the most explanatory factors influencing model generalization. [Method/Process] This paper presents two innovative contributions. First, a multi-dimensional text complexity calculation framework grounded in systemic functional linguistics theory was constructed. This framework employs a hierarchical quantification approach: at the lexical level, it dynamically identified four types of non-standard expressions - abbreviations, emoticons, internet buzzwords, and alphanumeric mixed words - and calculated a normative score using a non-linear formula. At the sentence level, an innovative inverse fusion enhancement method (IFEM) was proposed, integrating punctuation anomaly density (weight 0.1), colloquial word ratio (weight 0.4), semantic ambiguity (weight 0.2), and sentence length features (weight 0.3), and generating a structural score through modeling of feature synergy and suppression effects along with an adaptive weighting mechanism. Finally, at the corpus level, a weighted fusion output the global corpus complexity assessment. Experimental results demonstrated that this framework successfully quantifies intrinsic differences between domain texts. For instance, the measured complexity of the waimai_10k dataset reached 0.703, significantly higher than the 0.552 of the ChnSentiCorp_htl_all dataset, and it accurately captured complexity changes even after internal text reduction and substitution operations. Second, a knowledge base-enhanced dynamic adaptive CNN-BiLSTM model was designed. This model implemented the following innovative mechanisms: 1) The knowledge base adopts a dual mapping architecture of text-label and vector-label, supporting historical experience knowledge loading and real-time error recording; 2) Feature weights were adjusted based on the knowledge base content, such as strengthening positive semantic representations or weakening negative expressions. The model architecture integrated multi-scale CNN convolutional kernels for local feature extraction, bidirectional long short-term memory networks for capturing long-distance dependencies, and an attention mechanism to focus on key information. To validate the effectiveness of the proposed methods, experiments were conducted on four Chinese datasets. [Results/Conclusions] The results indicate that the complexity calculation framework exhibits strong robustness, with complexity fluctuations below 3.3% after a 20% sample reduction, and a maximum complexity increase of 13.8% upon short text data injection. Moreover, the framework effectively quantifies and differentiates text complexities, as evidenced by the 0.703 complexity of the waimai_10k dataset compared to the 0.552 of the ChnSentiCorp_htl_all dataset. Additionally, the proposed model demonstrated optimal performance across both the most standardized ChnSentiCorp_htl_all dataset and the most challenging waimai_10k dataset (achieving accuracies of 0.923 8 and 0.943 4, respectively), significantly outperforming Transformer and various large language models such as deepseek-v3 and qwen-plus.

Key words: cross-domain sentiment analysis, text complexity quantification, knowledge base enhancement, deep learning, dynamic self-adaptation

中图分类号:  TP391

引用本文

常郝, 徐涛涛, 李峰. 多维特征文本复杂度框架与知识库增强模型[J/OL]. 农业图书情报学报. https://doi.org/10.13998/j.cnki.issn1002-1248.25-0365.

CHANG Hao, XU Taotao, LI Feng. A Multi-dimensional Feature Text Complexity Framework and Knowledge Base Augmentation Model[J/OL]. Journal of library and information science in agriculture. https://doi.org/10.13998/j.cnki.issn1002-1248.25-0365.