高质量AI数据体系面临的数据版权困境、应对策略解析与实施路径研究

doi:10.13998/j.cnki.issn1002-1248.24-0475

Abstract

Abstract:

[Purpose/Significance] Improving the policy and governance systems to promote the development of strategic industries such as artificial intelligence was explicitly proposed in the resolution of the Third Plenary Session of the 20th Central Committee of the Communist Party of China. In recent years, the conflict between AI companies' desire for copyrighted data and the copyright holders' protection of copyrighted data has become increasingly apparent. There have been a number of lawsuits and disputes around the world regarding copyright infringement caused by artificial intelligence. The dilemma of copyright protection of AI training data has become a difficulty and bottleneck that urgently needs to be resolved in the development of high-quality data system for AI. [Method/Process] Based on the academic research and industrial practice on the copyright protection of AI data, this study systematically summarizes six representative approaches to address the copyright dilemma of AI training data, and provides a comparative analysis of the advantages, disadvantages, and applicability of these approaches. The six representative approaches are: signing a license agreement by both parties, initiating special plans or forming alliances, introducing a copyright notice mechanism, introducing a copyright risk guarantee mechanism, replacing with synthetic data, and applying copyright detection tools to large language models. For the copyright dilemma of AI training data, there is no optimal solution that can both encourage the supply of AI copyright training data and protect the copyright of data. [Results/Conclusions] In order to provide helpful references for increasing the supply of AI copyright data, formulating relevant policies, and promoting related work, this study has proposed a concept of general implementation path to build a high-quality data system for AI to solve the copyright dilemma of AI training data, based on the comparative analysis of the above six representative approaches and combined with China's four unique advantages. These include: 1) Integrating existing platforms to build a national-level integrated service platform for copyright data for AI, with state-owned enterprises (SOEs) under the direct administration of the central government taking the lead in establishing a national copyright data alliance and connecting copyright data to the platform. 2) To collaborate with local pilots of data intellectual property rights, explore and promote comprehensive reform pilot programs of copyright data adapted to the development of AI, and continuously strengthen the cooperation efforts and willingness between AI enterprises and copyright holders. 3) The focus should be on principled or critical issues, establishing and improving legislation related to copyright data for AI and promoting industry self-regulation.

Key words: artificial intelligence, data system for AI, copyright protection, copyright data, data elements

CLC Number:

TP3-05

Hecan ZHANG, Chengqi YI, Peng GUO, Qianqian HUANG, Xiaokun JIN. Copyright Data Dilemma of Building High-Quality Data System for AI: Present Situation, Coping Strategies, and Implementation Path[J].Journal of library and information science in agriculture, 2024, 36(9): 32-43.

Figures/Tables 3

Table1

Table2

Analysis of six representative approaches to address the copyright data dilemma"

代表性做法	优点	不足	侵权风险	适用情形
双方签订许可使用合同	获取版权数据效率最高、风险最低、适用范围最广	版权数据采购议价成本高、批量获取个人持有版权数据效率偏低	无	资金储备较为雄厚的人工智能企业，对数据质量规模和权威性有较高要求的科研院所、咨询机构等单位
发起专项计划或组建联盟	继承了签订许可使用合同的部分优点，一定程度缓解版权数据采购议价成本较高的问题	暂未取得实质进展或成效、多方共识难达成、执行效率和灵活性不足等	低	业内具有一定影响力、话语权较大或版权数据资源独特等企业发起或参与
引入版权声明机制	无获取版权授权的溯源和采购成本、适用范围广	声明易被忽视、操作技术性要求较高、大量作品“退出”将影响大模型性能等	中	有一定合规版权数据储备和技术能力的人工智能企业
引入版权风险担保机制	提升企业口碑、增加社会信任，在一定程度减少人工智能用户和版权所有者之间的诉讼纷争	部分用户使用过程中触发的侵权责任转移至企业自身，保障条款往往对担保情形有一定额外要求	中	有一定合规版权数据储备或是法律资金资源充足的人工智能企业
改用合成数据代替	生产数据效率高、成本低、可持续	无法完全根除版权保护风险隐患，进一步加大侵犯版权察觉溯源取证的难度	低	一些特定如数据原创性要求相对较低、版权数据规模要求相对较小等的场景应用
应用针对大模型的版权检测工具	缓解版权所有者察觉侵权和侵权取证维权问题，提升版权作品的创作动力和创作环境，帮助人工智能企业提前发现未获授权的版权数据	当前适用于人工智能大模型的监测工具较少问世且技术尚不成熟、提高企业版权数据管理成本	不涉及	具有公信力的第三方机构

Table2

Fig.1

References 29

1	于凤霞. 抓住人工智能“牛鼻子”加快形成新质生产力[EB/OL]. (2024-01-10)[2024-05-11].
2	张文娟, 邓辉, 艾政阳, 等. 我国AI大模型数据集建设发展刍议[J]. 人工智能, 2024, 11(3): 85-95.
	ZHANG W J, DENG H, AI Z Y, et al. On the construction and development of AI large model dataset in China[J]. AI-View, 2024, 11(3): 85-95.
3	腾讯研究院. AIGC发展趋势报告2023: 迎接人工智能的下一个时代[R/OL]. 北京: 腾讯研究院, 2023.
4	盘和林, 茹少峰, 易成岐. 深入推进数字经济创新发展[N]. 经济日报, 2024-06-12(010).
5	蔡津津. AIGC时代新闻舆论工作新阵地——面向大模型的可信训练数据集与服务能力建设[J]. 中国传媒科技, 2023(10): 79-83.
	CAI J J. A new position of news and public opinion work in AIGC era - Credible training data set and service capacity building for large model[J]. Media science and technology of China, 2023(10): 79-83.
6	新华社. 中共中央国务院关于构建数据基础制度更好发挥数据要素作用的意见[EB/OL]. (2022-12-19)[2024-05-11].
7	高雅文, 来小鹏. 生成式人工智能语料版权问题研究[J]. 出版广角, 2024(5): 27-34.
	GAO Y W, LAI X P. Research on copyright of generative artificial intelligence corpus[J]. View on publishing, 2024(5): 27-34.
8	张涛. 生成式人工智能训练数据集的法律风险与包容审慎规制[J]. 比较法研究, 2024(4): 86-103.
	ZHANG T. Legal risks of generative AI training datasets and inclusive prudential regulation[J]. Journal of comparative law, 2024(4): 86-103.
9	张平. 人工智能生成内容著作权合法性的制度难题及其解决路径[J]. 法律科学(西北政法大学学报), 2024, 42(3): 18-31.
	ZHANG P. The obstacles and solutions of copyright system in artificial intelligence content generation mechanism[J]. Science of law (Journal of northwest university of political science and law), 2024, 42(3): 18-31.
10	周文康, 费艳颖. 生成式人工智能创作使用作品的合理使用调适[J]. 科技与法律(中英文), 2024(3): 77-87.
	ZHOU W K, FEI Y Y. Fair use adjustment of the use of works by generative artificial intelligence creation[J]. Science technology and law (Chinese-English version), 2024(3): 77-87.
11	张惠彬, 肖启贤. 人工智能时代文本与数据挖掘的版权豁免规则建构[J]. 科技与法律(中英文), 2021(6): 74-84.
	ZHANG H B, XIAO Q X. The construction of copyright exemption rules for text and data mining in the era of artificial intelligence[J]. Science technology and law (Chinese-English version), 2021(6): 74-84.
12	郑飞, 夏晨斌. 生成式人工智能的著作权困境与制度应对——以ChatGPT和文心一言为例[J]. 科技与法律(中英文), 2023(5): 86-96.
	ZHENG F, XIA C B. The copyright dilemma and institutional response of generative artificial intelligence - Take ChatGPT and ERNIE bot as examples[J]. Science technology and law (Chinese-English version), 2023(5): 86-96.
13	林秀芹. 人工智能时代著作权合理使用制度的重塑[J]. 法学研究, 2021, 43(6): 170-185.
	LIN X Q. Reshaping the fair use system in copyright law in the AI era[J]. Chinese journal of law, 2021, 43(6): 170-185.
14	林华. 大数据的法律保护[J]. 电子知识产权, 2014(8): 80-85.
	LIN H. Legal protection of big data[J]. Electronics intellectual property, 2014(8): 80-85.
15	高阳. 衍生数据作为新型知识产权客体的学理证成[J]. 社会科学, 2022(2): 106-115.
	GAO Y. Theoretical justification of derivative data as a new type of intellectual property object[J]. Journal of social sciences, 2022(2): 106-115.
16	冯晓青. 数据财产化及其法律规制的理论阐释与构建[J]. 政法论丛, 2021(4): 81-97.
	FENG X Q. Theoretical interpretation and construction of data propertyization and its legal regulation[J]. Journal of political science and law, 2021(4): 81-97.
17	梅夏英. 企业数据权益原论: 从财产到控制[J]. 中外法学, 2021, 33(5): 1188-1207.
	MEI X Y. On the interests on enterprise data: From property to control[J]. Peking university law journal, 2021, 33(5): 1188-1207.
18	崔国斌. 大数据有限排他权的基础理论[J]. 法学研究, 2019, 41(5): 3-24.
	CUI G B. Towards a theory of limited exclusive right to big data[J]. Chinese journal of law, 2019, 41(5): 3-24.
19	冯晓青. 知识产权视野下商业数据保护研究[J]. 比较法研究, 2022(5): 31-45.
	FENG X Q. Commercial data protection from the perspective of intelectual property rights[J]. Journal of comparative law, 2022(5): 31-45.
20	朱长宝. 论在线浏览、欣赏目的临时复制的法律保护[J]. 电子知识产权, 2016(10): 79-87.
	ZHU C B. Study on legal protection of temporary reproduction of online browsing and appreciating[J]. Electronics intellectual property, 2016(10): 79-87.
21	张金平. 人工智能作品合理使用困境及其解决[J]. 环球法律评论, 2019, 41(3): 120-132.
	ZHANG J P. Fair use of artificial intelligence: Dilemma and solutions[J]. Global law review, 2019, 41(3): 120-132.
22	MURRAY M D. Generative AI art: Copyright infringement and fair use[J]. SMU science and technology law review, 2023, 26(2): 259.
23	叶兆驰. 人工智能生成物的侵权及解决路径[J]. 中南民族大学学报(人文社会科学版), 2024, 44(5): 156-163, 223.
	YE Z C. Infringement of AI-generated works and resolution pathways[J]. Journal of south-central Minzu University (humanities and social sciences), 2024, 44(5): 156-163, 223.
24	潘香军. 论机器学习训练集的著作权风险化解机制[C]//《上海法学研究》集刊2023年第6卷——2023年世界人工智能大会青年论坛论文集. 香港: 2023年世界人工智能大会青年论坛论, 2023: 12.
25	邵红红. 生成式人工智能版权侵权治理研究[J]. 出版发行研究, 2023(6): 29-38.
	SHAO H H. Research on copyright infringement governance of generative artificial intelligence[J]. Publishing research, 2023(6): 29-38.
26	BUITEN M, DE STREEL A, PEITZ M. The law and economics of AI liability[J]. Computer law & security review, 2023, 48: 105794.
27	刘小璇, 张虎. 论人工智能的侵权责任[J]. 南京社会科学, 2018(9): 105-110, 149.
	LIU X X, ZHANG H. On the tort liability of artificial intelligence[J]. Nanjing journal of social sciences, 2018(9): 105-110, 149.
28	LIOR A. AI strict liability vis-à-vis AI monopolization[J]. Science and technology law review, 2021, 22(1): 90-126.
29	朱阁. “AI文生图”的法律属性与权利归属研究[J]. 知识产权, 2024, 34(1): 24-35.
	ZHU G. A study on the legal attributes and ownership of "AI text-to-image"[J]. Intellectual property, 2024, 34(1): 24-35.

Related Articles 15

[1]	WU Dan, XU Hao. From Human-Computer Interaction to Human-AI Collaboration: A Frontier Perspective on Constructing an Independent Knowledge System for Information Resource Management in China [J]. Journal of library and information science in agriculture, 2026, 38(5): 55-64.
[2]	AN Lin. Governance of Personal Information Security in the Iteration of Generative AI: From the Perspective of the Technological Evolution of Large Models [J]. Journal of library and information science in agriculture, 2026, 38(4): 61-70.
[3]	LI Baiyang, REN Shangsheng. Technical Evolution and Application Scenarios of Open-Source Agents:A Case Study of "OpenClaw" [J]. Journal of library and information science in agriculture, 2026, 38(4): 23-35.
[4]	HU Anqi. Construction of an Artificial Intelligence Literacy Ability Framework and Training System for College Students [J]. Journal of library and information science in agriculture, 2026, 38(2): 42-55.
[5]	HUANG Xiaotang, YAO Qibin. Collaborative Development Path of GLAM Institutions Based on AIGC Technology Application [J]. Journal of library and information science in agriculture, 2026, 38(2): 66-78.
[6]	YI Chenhe, ZHANG Yuting. Risk Assessment and Early Warning of Generative Artificial Intelligence Impact on Network Public Opinion Based on Optimized BP Neural Network [J]. Journal of library and information science in agriculture, 2026, 38(2): 30-41.
[7]	GUO Hailing, ZENG Meiyun, FENG Yuxi. Model Construction and Strategies for AI-enabled University Library Services to Facilitate Scientific and Technological Achievement Transformation [J]. Journal of library and information science in agriculture, 2026, 38(2): 56-65.
[8]	ZHANG Ling. Integrating Digital Humanities and Agricultural Knowledge Services A Simulation Modeling Perspectives [J]. Journal of library and information science in agriculture, 2026, 38(2): 79-89.
[9]	JIANG Jingze, ZHOU Tianmin, LI Mei, CHENG Cheng, CHEN Haiyan. A study of the Core Competence Model of Compound AI Librarians in the Intelligent Transformation of University Libraries [J]. Journal of library and information science in agriculture, 2025, 37(9): 97-109.
[10]	SHEN Hongjie, SHEN Hongwei, WANG Junli. Generative AI Empowering Information Literacy Education in Digital Libraries: Path Exploration, Challenge Analysis, and Response Strategies [J]. Journal of library and information science in agriculture, 2025, 37(7): 50-60.
[11]	DONG Ke, SONG Yuchen, WU Jiachun. Layout and Characteristics of European AI Data Governance Policy [J]. Journal of library and information science in agriculture, 2025, 37(7): 4-18.
[12]	GAO Dan, CUI Bin. Value Co-Creation Mechanism of Cultural Heritage Data Resources: An Analysis Based on the “Stage-Subject-Scenario” Framework [J]. Journal of library and information science in agriculture, 2025, 37(7): 61-72.
[13]	ZHAI Jun, MENG Zihan, LI Fangsu, SHEN Lixin. AI Guides in Research Libraries of North America under the AI4S Context: Based on the Survey of 125 ARL Libraries [J]. Journal of library and information science in agriculture, 2025, 37(7): 35-49.
[14]	SHI Xujie, YUAN Fan, LI Jia. Searching as Learning in the Context of Generative Artificial Intelligence: Technological Pathways, Behavioral Evolution, and Ethical Challenges [J]. Journal of library and information science in agriculture, 2025, 37(5): 40-57.
[15]	CHEN Jiayong, GONG Jiaoteng, WANG Yuyi. Research of Interdisciplinary Comparison and Collaborative Paradigm on the Concept of Agent in Library Science [J]. Journal of library and information science in agriculture, 2025, 37(5): 27-39.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

达成合作时间	AI企业	版权所有者/著作权人	版权数据类型、协议期限及金额
2024年7月	微软	泰勒·弗朗西斯（Taylor & Francis）	论文期刊数据，协议期限不详、协议金额1 000万美元
2024年5月	OpenAI	美国新闻集团（News Corporation）	新闻数据，协议期限5年、协议金额超2.5亿美元
2024年4月	OpenAI	英国金融时报	新闻数据，协议期限金额不详
2024年2月	谷歌	Reddit平台	社交媒体数据，协议期限不详、协议金额6 000万美元
2024年1月	万兴科技	中广天择	视频数据，协议期限金额不详
2023年12月	OpenAI	施普林格出版集团（Axel Springer）	新闻数据，协议期限金额不详
2023年11月	谷歌	加拿大新闻出版商	新闻数据，协议期限不详、协议金额1亿加元（约合7 360万美元）
2023年10月	谷歌	德国Corint Media组织	新闻数据，协议期限不详、协议金额320万欧元（约合338万美元）
2023年9月	华为云	中文在线	包括文字音视频等文字数据，协议期限金额不详
2023年7月	OpenAI	美联社	新闻数据，协议期限金额不详

Copyright Data Dilemma of Building High-Quality Data System for AI: Present Situation, Coping Strategies, and Implementation Path

RichHTML

PDF (PC)