农业图书情报学报

• •    

基于数据平衡与集成学习的颠覆性技术识别模型优化研究

陈媛媛1, 胡少皇2, 陈小红1   

  1. 1.上海出版印刷高等专科学校 上海出版传媒研究院,上海 200093
    2.新疆师范大学 计算机科学技术学院,乌鲁木齐 830054
  • 收稿日期:2025-12-09 出版日期:2026-03-04
  • 作者简介:陈媛媛(1977- ),女,博士,副教授,研究方向为智库评价与管理、技术识别与转化研究
    胡少皇(1999- ),男,硕士研究生,研究方向为颠覆性技术识别
    陈小红(1980- ),女,副教授,研究方向为数字技术应用、技术识别研究
  • 基金资助:
    上海高校特聘教授(东方学者)岗位计划项目(TP2022126)

Optimization of Subversive Technology Identification Model Based on Data Balancing and Integrated Learning

CHEN Yuanyuan1, HU Shaohuang2   

  1. 1.Shanghai Publishing and Media Research Institute, Shanghai Publishing and Printing College, Shanghai 200093
    2.School of Computer Science and Technology, Xinjiang Normal University, Urumqi 830054
  • Received:2025-12-09 Online:2026-03-04

摘要:

【目的/意义】 针对现有颠覆性技术识别方法中存在的类别分布不均衡与单一模型性能受限问题,在复现现有识别模型(使用单一XGBoost)的基础上,提出一种基于数据平衡与集成学习的优化模型。 【方法/过程】 采用SMOTE-ENN混合采样策略对训练集进行重构,在保留少数类代表性样本的同时有效剔除噪声数据,缓解类别不平衡对模型训练的干扰;随后,构建基于XGBoost、LightGBM、Extra Trees、SVM等多基学习器的Stacking集成学习架构,并以随机森林作为元学习器实现特征互补与性能集成,提升模型的整体识别能力。 【结果/结论】 实验结果表明,优化后的模型在Accuracy、Precision、Recall与F1等核心指标上较原始模型均有大幅提升,其中F1值从0.63显著提升至0.98,表明该方法在应对高维噪声与样本不均衡场景下具有良好的适应性与稳定性。本研究所提出的优化模型方案不仅有效提升了颠覆性技术识别的准确性与稳定性,也为面向不平衡数据场景的技术文本建模提供了可借鉴的思路。

关键词: 数据平衡, 集成学习, 颠覆性技术识别, 模型性能优化, 机器学习

Abstract:

[Purpose/Significance] Disruptive technology identification has become an increasingly important research topic in the context of rapid technological evolution and strategic decision-making for governments and enterprises. However, existing data-driven identification approaches often suffer from two critical limitations. First, disruptive technology datasets are typically characterized by severe class imbalance, where truly disruptive cases constitute only a small fraction of the total samples, leading to biased learning and poor generalization. Second, most existing studies rely on a single machine learning model, which limits the ability to capture complex and heterogeneous patterns embedded in high-dimensional technical text features. These issues restrict the robustness, accuracy, and practical applicability of current identification frameworks. To address these challenges, this study aims to construct an optimized disruptive technology identification model that jointly considers data imbalance mitigation and model performance enhancement, thereby improving the reliability and stability of predictive results and contributing to methodological advancements in technology intelligence and innovation management research. [Method/Process] Based on the reproduction of a widely used baseline model built upon XGBoost, this study proposed a two-stage optimization framework integrating data resampling and ensemble learning. In the data preprocessing stage, a hybrid SMOTE-ENN sampling strategy was employed to reconstruct the training dataset. The SMOTE component synthetically generated minority class samples to enhance class representation, while the ENN component removed ambiguous and noisy samples from overlapping regions, thus achieving a balance between noise reduction and information preservation. This strategy effectively alleviated the adverse impact of class imbalance on model learning without excessively distorting the original data distribution. In the modeling stage, a stacking-based ensemble learning framework was constructed by integrating multiple heterogeneous base learners, including XGBoost, LightGBM, Extra Trees, and Support Vector Machines. These base models were selected to capture complementary decision boundaries and feature interactions from different learning perspectives. A Random Forest model was further employed as a meta-learner to aggregate the outputs of the base learners and perform higher-level feature integration. Through this hierarchical learning mechanism, the proposed framework enhanced both representation capability and predictive robustness, enabling more accurate identification of disruptive technologies under complex and noisy data conditions. [Results/Conclusions] Extensive experimental evaluations demonstrate that the proposed optimization model significantly outperforms the baseline XGBoost model across multiple core performance metrics, including Accuracy, Precision, Recall, and F1-Score. Notably, the F1-Score, which is substantially improved from 0.63 to 0.98, indicates a marked enhancement in the model's ability to correctly identify minority disruptive technology samples while maintaining high overall stability. The results confirm that the combined application of hybrid resampling and ensemble learning effectively addresses the challenges of sample imbalance and model bias in disruptive technology identification tasks. In conclusion, the proposed framework provides a robust and scalable solution for identifying disruptive technologies in high-dimensional, imbalanced data scenarios. Beyond improving prediction accuracy, this study offers methodological insights for technical text modeling and innovation analytics. Its approach can be easily adapted to other fields with similar data imbalance and complexity issues. Future research may further explore adaptive sampling strategies and deep learning-based ensemble architectures to enhance temporal and semantic representation capabilities.

Key words: data balance, integrated learning, subversive technology identification, model performance optimization, machine learning

中图分类号:  G305

引用本文

陈媛媛, 胡少皇, 陈小红. 基于数据平衡与集成学习的颠覆性技术识别模型优化研究[J/OL]. 农业图书情报学报. https://doi.org/10.13998/j.cnki.issn1002-1248.25-0536.

CHEN Yuanyuan, HU Shaohuang. Optimization of Subversive Technology Identification Model Based on Data Balancing and Integrated Learning[J/OL]. Journal of library and information science in agriculture. https://doi.org/10.13998/j.cnki.issn1002-1248.25-0536.