中文    English

Journal of library and information science in agriculture

   

Optimization of Subversive Technology Identification Model Based on Data Balancing and Integrated Learning

CHEN Yuanyuan1, HU Shaohuang2   

  1. 1.Shanghai Publishing and Media Research Institute, Shanghai Publishing and Printing College, Shanghai 200093
    2.School of Computer Science and Technology, Xinjiang Normal University, Urumqi 830054
  • Received:2025-12-09 Online:2026-03-04

Abstract:

[Purpose/Significance] Disruptive technology identification has become an increasingly important research topic in the context of rapid technological evolution and strategic decision-making for governments and enterprises. However, existing data-driven identification approaches often suffer from two critical limitations. First, disruptive technology datasets are typically characterized by severe class imbalance, where truly disruptive cases constitute only a small fraction of the total samples, leading to biased learning and poor generalization. Second, most existing studies rely on a single machine learning model, which limits the ability to capture complex and heterogeneous patterns embedded in high-dimensional technical text features. These issues restrict the robustness, accuracy, and practical applicability of current identification frameworks. To address these challenges, this study aims to construct an optimized disruptive technology identification model that jointly considers data imbalance mitigation and model performance enhancement, thereby improving the reliability and stability of predictive results and contributing to methodological advancements in technology intelligence and innovation management research. [Method/Process] Based on the reproduction of a widely used baseline model built upon XGBoost, this study proposed a two-stage optimization framework integrating data resampling and ensemble learning. In the data preprocessing stage, a hybrid SMOTE-ENN sampling strategy was employed to reconstruct the training dataset. The SMOTE component synthetically generated minority class samples to enhance class representation, while the ENN component removed ambiguous and noisy samples from overlapping regions, thus achieving a balance between noise reduction and information preservation. This strategy effectively alleviated the adverse impact of class imbalance on model learning without excessively distorting the original data distribution. In the modeling stage, a stacking-based ensemble learning framework was constructed by integrating multiple heterogeneous base learners, including XGBoost, LightGBM, Extra Trees, and Support Vector Machines. These base models were selected to capture complementary decision boundaries and feature interactions from different learning perspectives. A Random Forest model was further employed as a meta-learner to aggregate the outputs of the base learners and perform higher-level feature integration. Through this hierarchical learning mechanism, the proposed framework enhanced both representation capability and predictive robustness, enabling more accurate identification of disruptive technologies under complex and noisy data conditions. [Results/Conclusions] Extensive experimental evaluations demonstrate that the proposed optimization model significantly outperforms the baseline XGBoost model across multiple core performance metrics, including Accuracy, Precision, Recall, and F1-Score. Notably, the F1-Score, which is substantially improved from 0.63 to 0.98, indicates a marked enhancement in the model's ability to correctly identify minority disruptive technology samples while maintaining high overall stability. The results confirm that the combined application of hybrid resampling and ensemble learning effectively addresses the challenges of sample imbalance and model bias in disruptive technology identification tasks. In conclusion, the proposed framework provides a robust and scalable solution for identifying disruptive technologies in high-dimensional, imbalanced data scenarios. Beyond improving prediction accuracy, this study offers methodological insights for technical text modeling and innovation analytics. Its approach can be easily adapted to other fields with similar data imbalance and complexity issues. Future research may further explore adaptive sampling strategies and deep learning-based ensemble architectures to enhance temporal and semantic representation capabilities.

Key words: data balance, integrated learning, subversive technology identification, model performance optimization, machine learning

CLC Number: 

  • G305

Fig.1

Experimental framework diagram"

Fig.2

SMOTE-ENN algorithm flow chart"

Fig.3

Stacking frame diagram"

Table 1

Reproduction results of original scheme (threshold is 0.22)"

类别准确率精确率召回率F1值
0(非颠覆性)0.800.880.890.89
1(颠覆性)0.210.190.20
macro avg0.550.540.54
weighted avg0.800.800.80

Fig.4

Comparison flow chart of sampling methods"

Table 2

Training results of SMOTE and SMOTE-ENN methods"

方法准确率负样本精确率负样本召回率负样本F1分数正样本精确率正样本召回率正样本F1分数
原方法0.630.900.650.760.180.520.26
增加SMOTE0.850.860.830.850.840.870.85
增加SMOTE-ENN0.950.920.960.940.950.950.95

Fig.5

Stacking training flow chart"

Fig.6

ROC and PRC curves"

Table 3

Comparison of model training results"

模型准确率召回率精确率F分数ROC曲线下面积P曲线下面积
xgb0.945 3610.936 2830.968 8640.952 2950.947 1540.944 245
lgb0.952 5770.953 9820.964 2220.959 0750.952 3000.946 655
rfb0.946 3920.962 8320.946 0870.954 3860.943 1440.932 572
extra0.978 3510.976 9910.985 7140.981 3330.978 6190.976 436
svm0.849 4850.890 2650.856 8990.873 2640.841 4290.826 786
knn0.940 2060.907 9650.988 4390.946 4940.946 5750.951 076
mlp0.868 0410.874 3360.896 5520.885 3050.866 7980.857 084
cat0.952 5770.962 8320.956 0630.959 4360.950 5520.942 178
stack0.978 3510.968 1420.994 5450.981 1660.980 3670.981 418

Fig.7

Ablation experiment results"

[1] Christensen C M. The Innovator's Dilemma[M]. Boston, Mass.: Harvard Business School Press, 1997.
[2] 中国科学院颠覆性技术创新研究组. 颠覆性技术创新研究-生命科学领域[M]. 北京: 科学出版社, 2020.
[3] Sommarberg M, Mäkinen S J. A method for anticipating the disruptive nature of digitalization in the machine-building industry[J]. Technological Forecasting and Social Change, 2019, 146: 808-819.
[4] 白光祖, 郑玉荣, 吴新年, 等. 基于文献知识关联的颠覆性技术预见方法研究与实证[J]. 情报杂志, 2017, 36(9): 38-44.
Bai Guangzu, Zheng Yurong, Wu Xinnian, et al. Research and demonstration on forecasting method of disruptive technology based on literature knowledge correlation[J]. Journal of Intelligence, 2017, 36(9): 38-44.
[5] 王知津, 周鹏, 韩正彪. 基于情景分析法的技术预测研究[J]. 图书情报知识, 2013, 30(5): 115-122.
Wang Zhijin, Zhou Peng, Han Zhengbiao. A study of the technological forecasting based on scenario analysis[J]. Documentation, Information & Knowledge, 2013, 30(5): 115-122.
[6] 曹悦, 白晨, 张英杰, 等. 颠覆性技术识别模型研究——以工业机器人领域为例[J]. 中国科技资源导刊, 2022, 54(2): 81-92.
Cao Yue, Bai Chen, Zhang Yingjie, et al. Research on disruptive technology recognition model - Taking industrial robots as an example[J]. China Science & Technology Resources Review, 2022, 54(2): 81-92.
[7] 陈育新, 卢俊, 韩毅. 基于专利文献的颠覆性技术识别研究——以人工智能为例[J]. 情报学报, 2022, 41(11): 1124-1133.
Chen Yuxin, Lu Jun, Han Yi. Topic prediction for disruptive technologies based on patent literature - A case study of artificial intelligence patents[J]. Journal of the China Society for Scientific and Technical Information, 2022, 41(11): 1124-1133.
[8] 刘雨农, 石静, 梁琴琴. 基于STM的颠覆性技术主题识别研究[J]. 情报工程, 2023, 9(3): 81-91.
Liu Yunong, Shi Jing, Liang Qinqin. STM-based topic identification for disruptive technologies[J]. Technology Intelligence Engineering, 2023, 9(3): 81-91.
[9] 赵一鸣, 刘顺生, 吕璐成. 基于集成学习的颠覆性技术早期识别研究——以量子计算领域为例[J]. 数据分析与知识发现, 2025, 9(10): 85-98.
Zhao Yiming, Liu Shunsheng, Lucheng Lyu. Early identification of disruptive technologies with ensemble learning: Case study in quantum computing[J]. Data Analysis and Knowledge Discovery, 2025, 9(10): 85-98.
[10] 王莉晓, 陈伟, 邱含琪. 基于机器学习的颠覆性技术弱信号识别模型研究[J]. 数据分析与知识发现, 2024, 8(8): 63-75.
Wang Lixiao, Chen Wei, Qiu Hanqi. Weak signal identification model for disruptive technologies based on machine learning[J]. Data Analysis and Knowledge Discovery, 2024, 8(8): 63-75.
[11] 徐硕, 李静鸿, 安欣. 基于专利术语的颠覆性技术识别及实证研究[J]. 图书情报工作, 2024, 68(2): 62-72.
Xu Shuo, Li Jinghong, An Xin. Disruptive technology identification and empirical study on the basis of patent terms[J]. Library and Information Service, 2024, 68(2): 62-72.
[12] 唐虎林, 苏成, 李旺雨. 基于系统思维的颠覆性技术弱信号分析理论研究[J]. 情报学报, 2025, 44(4): 398-413.
Tang Hulin, Su Cheng, Li Wangyu. Theoretical research on the weak signal analysis of disruptive technology based on system thinking[J]. Journal of the China Society for Scientific and Technical Information, 2025, 44(4): 398-413.
[13] Li Ran, Yu Wangke, Huang Qianliang, et al. A new identify disruptive technologies algorithm based on technology develop network[J]. Mathematical Problems in Engineering, 2022, 2022(1): 7354535.
[14] Vojak B A, Chambers F A. Roadmapping disruptive technical threats and opportunities in complex, technology-based subsystems: The SAILS methodology[J]. Technological Forecasting and Social Change, 2004, 71(1/2): 121-139.
[15] Walsh S T, Boylan R L, McDermott C, et al. The semiconductor silicon industry roadmap: Epochs driven by the dynamics between disruptive technologies and core competencies[J]. Technological Forecasting and Social Change, 2005, 72(2): 213-236.
[16] 王燕鹏, 王学昭. 技术突破和场景牵引视角下颠覆性技术量化识别方法研究[J]. 情报理论与实践, 2025, 48(3): 143-150, 159.
Wang Yanpeng, Wang Xuezhao. Research on quantitative identification method of disruptive technology from the perspective of technological breakthrough and scenario traction[J]. Information Studies (Theory & Application), 2025, 48(3): 143-150, 159.
[17] 李乾瑞, 郭俊芳, 黄颖, 等. 基于突变-融合视角的颠覆性技术主题演化研究[J]. 科学学研究, 2021, 39(12): 2129-2139.
Li Qianrui, Guo Junfang, Huang Ying, et al. Topic evolution research of disruptive technology based on mutation and fusion perspective[J]. Studies in Science of Science, 2021, 39(12): 2129-2139.
[18] 冯立杰, 秦浩, 王金凤, 等. 融合专利数据与社交媒体数据的潜在颠覆性技术识别——基于深度学习模型[J]. 情报学报, 2024, 43(2): 181-197.
Feng Lijie, Qin Hao, Wang Jinfeng, et al. A deep learning approach for identification of potentially disruptive technologies by integrating patent data and social media[J]. Journal of the China Society for Scientific and Technical Information, 2024, 43(2): 181-197.
[19] Dotsika F, Watkins A. Identifying potentially disruptive trends by means of keyword network analysis[J]. Technological Forecasting and Social Change, 2017, 119: 114-127.
[20] 夏若雨. 基于多源数据的颠覆性技术识别方法研究[D]. 绵阳: 西南科技大学, 2024.
Xia Ruoyu. Research on Disruptive Technology Identification Based on Multisource Data[D]. Mianyang: Southwest University of Science and Technology, 2024.
[21] 王海军, 于佳文. 基于专利和微博的颠覆性技术主题识别研究——以人工智能领域为例[J]. 中国科技论坛, 2024(7): 83-94, 109.
Wang Haijun, Yu Jiawen. Identification research on disruptive technology topics based on patents and microblogging - A artificial intelligence industry case[J]. Forum on Science and Technology in China, 2024(7): 83-94, 109.
[22] Xu Xueming, Li Jichao, Jiang Jiang, et al. A disruptive technology identification method based on multisource data: Take unmanned aerial vehicle systems as an example[C]//2021 7th International Conference on Big Data and Information Analytics (BigDIA). Piscataway, New Jersey: IEEE, 2021: 428-435.
[23] Liu Xiwen, Wang Xuezhao, Lucheng Lyu, et al. Identifying disruptive technologies by integrating multi-source data[J]. Scientometrics, 2022, 127(9): 5325-5351.
[24] 马晓迪. 基于机器学习的潜在颠覆性技术识别方法研究[D]. 北京: 北京工业大学, 2023.
Ma Xiaodi. Research on Potential Disruptive Technology Identification Methods Based on Machine Learning[D]. Beijing: Beijing University of Technology, 2023.
[25] Li Xin, Ma Xiaodi. Early identification of potential disruptive technologies using machine learning and text mining[C]//2023 Portland International Conference on Management of Engineering and Technology (PICMET). Piscataway, New Jersey: IEEE, 2023: 1-15.
[26] Chen Xiaoli, Han Tao. Disruptive technology forecasting based on gartner hype cycle[C]//2019 IEEE Technology & Engineering Management Conference (TEMSCON). Piscataway, New Jersey: IEEE, 2019: 1-6.
[27] 范明姐. 基于多源异构数据的颠覆性技术早期识别研究[D]. 北京: 北京工业大学, 2020.
Fan Mingjie. Early Identification of Disruptive Technology Based on Multi-Source Heterogeneous Data[D]. Beijing: Beijing University of Technology, 2020.
[28] 王萌萌, 吴艾晗, 邓琨升, 等. 基于学科交叉驱动的颠覆性技术预测研究[J]. 情报杂志, 2025, 44(3): 72-80, 138.
Wang Mengmeng, Wu Aihan, Deng Kunsheng, et al. Research on predicting disruptive technology driven by interdisciplinarity[J]. Journal of Intelligence, 2025, 44(3): 72-80, 138.
[29] 马亚雪, 王嘉杰, 巴志超, 等. 颠覆性技术的后向科学引文知识特征识别研究——以基因工程领域为例[J]. 图书情报工作, 2024, 68(1): 116-126.
Ma Yaxue, Wang Jiajie, Ba Zhichao, et al. Research on the knowledge feature identification of disruptive technologies from its backward scientific citations: Taking the field of genetic engineering as an example[J]. Library and Information Service, 2024, 68(1): 116-126.
[30] Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: Synthetic minority over-sampling technique[J]. Journal of Artificial Intelligence Research, 2002, 16: 321-357.
[31] Wilson D L. Asymptotic properties of nearest neighbor rules using edited data[J]. IEEE Transactions on Systems, Man, and Cybernetics, 1972, SMC-2(3): 408-421.
[32] 张高源, 赵瑞青, 张岩波, 等. 基于SMOTE-ENN结合改进动态集成选择算法构建DLBCL患者2年内复发预测模型[J]. 中国卫生统计, 2025, 42(1): 50-55, 61.
Zhang Gaoyuan, Zhao Ruiqing, Zhang Yanbo, et al. Recurrence prediction model of DLBCL patients within 2 years based on SMOTE-ENN combined with improved dynamic ensemble selection algorithm[J]. Chinese Journal of Health Statistics, 2025, 42(1): 50-55, 61.
[33] Wolpert D H. Stacked generalization[J]. Neural Networks, 1992, 5(2): 241-259.
[34] Nemet G F, Husmann D. PV learning curves and cost dynamics[M]//Advances in Photovoltaics: Volume 1. Amsterdam: Elsevier, 2012: 85-142.
[35] Sun Bixuan, Kolesnikov S, Goldstein A, et al. A dynamic approach for identifying technological breakthroughs with an application in solar photovoltaics[J]. Technological Forecasting and Social Change, 2021, 165(C): 120534
[1] WEI Tianyu, LIU Zhongyi, ZHANG Ning. Influencing Mechanism of the Social Role of Government Digital Human on Public Adoption Behavior [J]. Journal of library and information science in agriculture, 2025, 37(2): 49-60.
[2] XIANG Rui, SUN Wei. Methodology for Assessing the Influence of Technical Topics Based on PhraseLDA-SNA and Machine Learning [J]. Journal of library and information science in agriculture, 2024, 36(4): 45-62.
[3] ZHAO Wanjing, LIU Minjuan, LIU Hongbing, WANG Xin, DUAN Feihu. A Fine-grained Extraction Method of Chapter Structure of Documents Based on PDF Layout Features [J]. Journal of library and information science in agriculture, 2021, 33(9): 93-103.
[4] HU Lin, LIU Tingting, LI Huan, CUI Yunpeng. Prospects for Machine Learning Research and its Application in Agriculture [J]. Journal of library and information science in agriculture, 2019, 31(10): 12-22.
[5] ZHI Yingying. Exploration on the Application of Machine Learning in Library Discover System —Taking the Discover Tool Yewno Based on Knowledge Graph as Example [J]. Journal of library and information science in agriculture, 2018, 30(7): 47-50.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!