基于数据平衡与集成学习的颠覆性技术识别模型优化研究

doi:10.13998/j.cnki.issn1002-1248.25-0536

摘要/Abstract

摘要：

[目的/意义] 针对现有颠覆性技术识别方法中存在的类别分布不均衡与单一模型性能受限问题，在复现现有识别模型（使用单一XGBoost）的基础上，提出一种基于数据平衡与集成学习的优化模型。 [方法/过程] 采用SMOTE-ENN混合采样策略对训练集进行重构，在保留少数类代表性样本的同时有效剔除噪声数据，缓解类别不平衡对模型训练的干扰；随后，构建基于XGBoost、LightGBM、Extra Trees、SVM等多基学习器的Stacking集成学习架构，并以随机森林作为元学习器实现特征互补与性能集成，提升模型的整体识别能力。 [结果/结论] 实验结果表明，优化后的模型在Accuracy、Precision、Recall与F1等核心指标上较原始模型均有大幅提升，其中F1值从0.63显著提升至0.98，表明该方法在应对高维噪声与样本不均衡场景下具有良好的适应性与稳定性。本研究所提出的优化模型方案不仅有效提升了颠覆性技术识别的准确性与稳定性，也为面向不平衡数据场景的技术文本建模提供了可借鉴的思路。

关键词: 数据平衡, 集成学习, 颠覆性技术识别, 模型性能优化, 机器学习

Abstract:

[Purpose/Significance] Disruptive technology identification has become an increasingly important research topic in the context of rapid technological evolution and strategic decision-making for governments and enterprises. However, existing data-driven identification approaches often suffer from two critical limitations. First, disruptive technology datasets are typically characterized by severe class imbalance, where truly disruptive cases constitute only a small fraction of the total samples, leading to biased learning and poor generalization. Second, most existing studies rely on a single machine learning model, which limits the ability to capture complex and heterogeneous patterns embedded in high-dimensional technical text features. These issues restrict the robustness, accuracy, and practical applicability of current identification frameworks. To address these challenges, this study aims to construct an optimized disruptive technology identification model that jointly considers data imbalance mitigation and model performance enhancement, thereby improving the reliability and stability of predictive results and contributing to methodological advancements in technology intelligence and innovation management research. [Method/Process] Based on the reproduction of a widely used baseline model built upon XGBoost, this study proposed a two-stage optimization framework integrating data resampling and ensemble learning. In the data preprocessing stage, a hybrid SMOTE-ENN sampling strategy was employed to reconstruct the training dataset. The SMOTE component synthetically generated minority class samples to enhance class representation, while the ENN component removed ambiguous and noisy samples from overlapping regions, thus achieving a balance between noise reduction and information preservation. This strategy effectively alleviated the adverse impact of class imbalance on model learning without excessively distorting the original data distribution. In the modeling stage, a stacking-based ensemble learning framework was constructed by integrating multiple heterogeneous base learners, including XGBoost, LightGBM, Extra Trees, and Support Vector Machines. These base models were selected to capture complementary decision boundaries and feature interactions from different learning perspectives. A Random Forest model was further employed as a meta-learner to aggregate the outputs of the base learners and perform higher-level feature integration. Through this hierarchical learning mechanism, the proposed framework enhanced both representation capability and predictive robustness, enabling more accurate identification of disruptive technologies under complex and noisy data conditions. [Results/Conclusions] Extensive experimental evaluations demonstrate that the proposed optimization model significantly outperforms the baseline XGBoost model across multiple core performance metrics, including Accuracy, Precision, Recall, and F1-Score. Notably, the F1-Score, which is substantially improved from 0.63 to 0.98, indicates a marked enhancement in the model's ability to correctly identify minority disruptive technology samples while maintaining high overall stability. The results confirm that the combined application of hybrid resampling and ensemble learning effectively addresses the challenges of sample imbalance and model bias in disruptive technology identification tasks. In conclusion, the proposed framework provides a robust and scalable solution for identifying disruptive technologies in high-dimensional, imbalanced data scenarios. Beyond improving prediction accuracy, this study offers methodological insights for technical text modeling and innovation analytics. Its approach can be easily adapted to other fields with similar data imbalance and complexity issues. Future research may further explore adaptive sampling strategies and deep learning-based ensemble architectures to enhance temporal and semantic representation capabilities.

Key words: data balance, integrated learning, subversive technology identification, model performance optimization, machine learning

中图分类号: G305,TP18

陈媛媛, 胡少皇, 陈小红. 基于数据平衡与集成学习的颠覆性技术识别模型优化研究[J]. 农业图书情报学报, 2026, 38(6): 86-97.

CHEN Yuanyuan, HU Shaohuang, CHEN Xiaohong. Optimization of Subversive Technology Identification Model Based on Data Balancing and Integrated Learning[J]. Journal of library and information science in agriculture, 2026, 38(6): 86-97.

图/表 10

图1

图2

图3

表1

图4

表2

图5

图6

表3

图7

参考文献 35

[1]	Christensen C M. The Innovator's Dilemma[M]. Boston, Mass.: Harvard Business School Press, 1997.
[2]	中国科学院颠覆性技术创新研究组. 颠覆性技术创新研究-生命科学领域[M]. 北京: 科学出版社, 2020.
[3]	Sommarberg M, Mäkinen S J. A method for anticipating the disruptive nature of digitalization in the machine-building industry[J]. Technological Forecasting and Social Change, 2019, 146: 808-819.
[4]	白光祖, 郑玉荣, 吴新年, 等. 基于文献知识关联的颠覆性技术预见方法研究与实证[J]. 情报杂志, 2017, 36(9): 38-44.
	Bai Guangzu, Zheng Yurong, Wu Xinnian, et al. Research and demonstration on forecasting method of disruptive technology based on literature knowledge correlation[J]. Journal of Intelligence, 2017, 36(9): 38-44.
[5]	王知津, 周鹏, 韩正彪. 基于情景分析法的技术预测研究[J]. 图书情报知识, 2013, 30(5): 115-122.
	Wang Zhijin, Zhou Peng, Han Zhengbiao. A study of the technological forecasting based on scenario analysis[J]. Documentation, Information & Knowledge, 2013, 30(5): 115-122.
[6]	曹悦, 白晨, 张英杰, 等. 颠覆性技术识别模型研究——以工业机器人领域为例[J]. 中国科技资源导刊, 2022, 54(2): 81-92.
	Cao Yue, Bai Chen, Zhang Yingjie, et al. Research on disruptive technology recognition model - Taking industrial robots as an example[J]. China Science & Technology Resources Review, 2022, 54(2): 81-92.
[7]	陈育新, 卢俊, 韩毅. 基于专利文献的颠覆性技术识别研究——以人工智能为例[J]. 情报学报, 2022, 41(11): 1124-1133.
	Chen Yuxin, Lu Jun, Han Yi. Topic prediction for disruptive technologies based on patent literature - A case study of artificial intelligence patents[J]. Journal of the China Society for Scientific and Technical Information, 2022, 41(11): 1124-1133.
[8]	刘雨农, 石静, 梁琴琴. 基于STM的颠覆性技术主题识别研究[J]. 情报工程, 2023, 9(3): 81-91.
	Liu Yunong, Shi Jing, Liang Qinqin. STM-based topic identification for disruptive technologies[J]. Technology Intelligence Engineering, 2023, 9(3): 81-91.
[9]	赵一鸣, 刘顺生, 吕璐成. 基于集成学习的颠覆性技术早期识别研究——以量子计算领域为例[J]. 数据分析与知识发现, 2025, 9(10): 85-98.
	Zhao Yiming, Liu Shunsheng, Lucheng Lyu. Early identification of disruptive technologies with ensemble learning: Case study in quantum computing[J]. Data Analysis and Knowledge Discovery, 2025, 9(10): 85-98.
[10]	王莉晓, 陈伟, 邱含琪. 基于机器学习的颠覆性技术弱信号识别模型研究[J]. 数据分析与知识发现, 2024, 8(8): 63-75.
	Wang Lixiao, Chen Wei, Qiu Hanqi. Weak signal identification model for disruptive technologies based on machine learning[J]. Data Analysis and Knowledge Discovery, 2024, 8(8): 63-75.
[11]	徐硕, 李静鸿, 安欣. 基于专利术语的颠覆性技术识别及实证研究[J]. 图书情报工作, 2024, 68(2): 62-72.
	Xu Shuo, Li Jinghong, An Xin. Disruptive technology identification and empirical study on the basis of patent terms[J]. Library and Information Service, 2024, 68(2): 62-72.
[12]	唐虎林, 苏成, 李旺雨. 基于系统思维的颠覆性技术弱信号分析理论研究[J]. 情报学报, 2025, 44(4): 398-413.
	Tang Hulin, Su Cheng, Li Wangyu. Theoretical research on the weak signal analysis of disruptive technology based on system thinking[J]. Journal of the China Society for Scientific and Technical Information, 2025, 44(4): 398-413.
[13]	Li Ran, Yu Wangke, Huang Qianliang, et al. A new identify disruptive technologies algorithm based on technology develop network[J]. Mathematical Problems in Engineering, 2022, 2022(1): 7354535.
[14]	Vojak B A, Chambers F A. Roadmapping disruptive technical threats and opportunities in complex, technology-based subsystems: The SAILS methodology[J]. Technological Forecasting and Social Change, 2004, 71(1/2): 121-139.
[15]	Walsh S T, Boylan R L, McDermott C, et al. The semiconductor silicon industry roadmap: Epochs driven by the dynamics between disruptive technologies and core competencies[J]. Technological Forecasting and Social Change, 2005, 72(2): 213-236.
[16]	王燕鹏, 王学昭. 技术突破和场景牵引视角下颠覆性技术量化识别方法研究[J]. 情报理论与实践, 2025, 48(3): 143-150, 159.
	Wang Yanpeng, Wang Xuezhao. Research on quantitative identification method of disruptive technology from the perspective of technological breakthrough and scenario traction[J]. Information Studies (Theory & Application), 2025, 48(3): 143-150, 159.
[17]	李乾瑞, 郭俊芳, 黄颖, 等. 基于突变-融合视角的颠覆性技术主题演化研究[J]. 科学学研究, 2021, 39(12): 2129-2139.
	Li Qianrui, Guo Junfang, Huang Ying, et al. Topic evolution research of disruptive technology based on mutation and fusion perspective[J]. Studies in Science of Science, 2021, 39(12): 2129-2139.
[18]	冯立杰, 秦浩, 王金凤, 等. 融合专利数据与社交媒体数据的潜在颠覆性技术识别——基于深度学习模型[J]. 情报学报, 2024, 43(2): 181-197.
	Feng Lijie, Qin Hao, Wang Jinfeng, et al. A deep learning approach for identification of potentially disruptive technologies by integrating patent data and social media[J]. Journal of the China Society for Scientific and Technical Information, 2024, 43(2): 181-197.
[19]	Dotsika F, Watkins A. Identifying potentially disruptive trends by means of keyword network analysis[J]. Technological Forecasting and Social Change, 2017, 119: 114-127.
[20]	夏若雨. 基于多源数据的颠覆性技术识别方法研究[D]. 绵阳: 西南科技大学, 2024.
	Xia Ruoyu. Research on Disruptive Technology Identification Based on Multisource Data[D]. Mianyang: Southwest University of Science and Technology, 2024.
[21]	王海军, 于佳文. 基于专利和微博的颠覆性技术主题识别研究——以人工智能领域为例[J]. 中国科技论坛, 2024(7): 83-94, 109.
	Wang Haijun, Yu Jiawen. Identification research on disruptive technology topics based on patents and microblogging - A artificial intelligence industry case[J]. Forum on Science and Technology in China, 2024(7): 83-94, 109.
[22]	Xu Xueming, Li Jichao, Jiang Jiang, et al. A disruptive technology identification method based on multisource data: Take unmanned aerial vehicle systems as an example[C]//2021 7th International Conference on Big Data and Information Analytics (BigDIA). Piscataway, New Jersey: IEEE, 2021: 428-435.
[23]	Liu Xiwen, Wang Xuezhao, Lucheng Lyu, et al. Identifying disruptive technologies by integrating multi-source data[J]. Scientometrics, 2022, 127(9): 5325-5351.
[24]	马晓迪. 基于机器学习的潜在颠覆性技术识别方法研究[D]. 北京: 北京工业大学, 2023.
	Ma Xiaodi. Research on Potential Disruptive Technology Identification Methods Based on Machine Learning[D]. Beijing: Beijing University of Technology, 2023.
[25]	Li Xin, Ma Xiaodi. Early identification of potential disruptive technologies using machine learning and text mining[C]//2023 Portland International Conference on Management of Engineering and Technology (PICMET). Piscataway, New Jersey: IEEE, 2023: 1-15.
[26]	Chen Xiaoli, Han Tao. Disruptive technology forecasting based on gartner hype cycle[C]//2019 IEEE Technology & Engineering Management Conference (TEMSCON). Piscataway, New Jersey: IEEE, 2019: 1-6.
[27]	范明姐. 基于多源异构数据的颠覆性技术早期识别研究[D]. 北京: 北京工业大学, 2020.
	Fan Mingjie. Early Identification of Disruptive Technology Based on Multi-Source Heterogeneous Data[D]. Beijing: Beijing University of Technology, 2020.
[28]	王萌萌, 吴艾晗, 邓琨升, 等. 基于学科交叉驱动的颠覆性技术预测研究[J]. 情报杂志, 2025, 44(3): 72-80, 138.
	Wang Mengmeng, Wu Aihan, Deng Kunsheng, et al. Research on predicting disruptive technology driven by interdisciplinarity[J]. Journal of Intelligence, 2025, 44(3): 72-80, 138.
[29]	马亚雪, 王嘉杰, 巴志超, 等. 颠覆性技术的后向科学引文知识特征识别研究——以基因工程领域为例[J]. 图书情报工作, 2024, 68(1): 116-126.
	Ma Yaxue, Wang Jiajie, Ba Zhichao, et al. Research on the knowledge feature identification of disruptive technologies from its backward scientific citations: Taking the field of genetic engineering as an example[J]. Library and Information Service, 2024, 68(1): 116-126.
[30]	Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: Synthetic minority over-sampling technique[J]. Journal of Artificial Intelligence Research, 2002, 16: 321-357.
[31]	Wilson D L. Asymptotic properties of nearest neighbor rules using edited data[J]. IEEE Transactions on Systems, Man, and Cybernetics, 1972, SMC-2(3): 408-421.
[32]	张高源, 赵瑞青, 张岩波, 等. 基于SMOTE-ENN结合改进动态集成选择算法构建DLBCL患者2年内复发预测模型[J]. 中国卫生统计, 2025, 42(1): 50-55, 61.
	Zhang Gaoyuan, Zhao Ruiqing, Zhang Yanbo, et al. Recurrence prediction model of DLBCL patients within 2 years based on SMOTE-ENN combined with improved dynamic ensemble selection algorithm[J]. Chinese Journal of Health Statistics, 2025, 42(1): 50-55, 61.
[33]	Wolpert D H. Stacked generalization[J]. Neural Networks, 1992, 5(2): 241-259.
[34]	Nemet G F, Husmann D. PV learning curves and cost dynamics[M]//Advances in Photovoltaics: Volume 1. Amsterdam: Elsevier, 2012: 85-142.
[35]	Sun Bixuan, Kolesnikov S, Goldstein A, et al. A dynamic approach for identifying technological breakthroughs with an application in solar photovoltaics[J]. Technological Forecasting and Social Change, 2021, 165(C): 120534

类别	准确率	精确率	召回率	F1值
0（非颠覆性）	0.80	0.88	0.89	0.89
1（颠覆性）		0.21	0.19	0.20
macro avg		0.55	0.54	0.54
weighted avg		0.80	0.80	0.80

方法	准确率	负样本精确率	负样本召回率	负样本F1分数	正样本精确率	正样本召回率	正样本F1分数
原方法	0.63	0.90	0.65	0.76	0.18	0.52	0.26
增加SMOTE	0.85	0.86	0.83	0.85	0.84	0.87	0.85
增加SMOTE-ENN	0.95	0.92	0.96	0.94	0.95	0.95	0.95

模型	准确率	召回率	精确率	F分数	ROC曲线下面积	P曲线下面积
xgb	0.945 361	0.936 283	0.968 864	0.952 295	0.947 154	0.944 245
lgb	0.952 577	0.953 982	0.964 222	0.959 075	0.952 300	0.946 655
rfb	0.946 392	0.962 832	0.946 087	0.954 386	0.943 144	0.932 572
extra	0.978 351	0.976 991	0.985 714	0.981 333	0.978 619	0.976 436
svm	0.849 485	0.890 265	0.856 899	0.873 264	0.841 429	0.826 786
knn	0.940 206	0.907 965	0.988 439	0.946 494	0.946 575	0.951 076
mlp	0.868 041	0.874 336	0.896 552	0.885 305	0.866 798	0.857 084
cat	0.952 577	0.962 832	0.956 063	0.959 436	0.950 552	0.942 178
stack	0.978 351	0.968 142	0.994 545	0.981 166	0.980 367	0.981 418