AI

Silicon Valley Giants Exploit 170,000+ YouTube Videos for AI Training Without Creators' Consent

17 July 2024

|

Zaker Adham

Summary

In a joint investigation by Proof News and Wired, it has come to light that several major tech companies, including NVIDIA, Apple, Salesforce, and Anthropic, have utilized content from a massive number of YouTube videos to train their AI models, bypassing YouTube's strict rules against unauthorized content harvesting.

The investigation revealed that these companies used a service known as YouTube Subtitles to extract subtitles from 173,536 YouTube videos across 48,000+ channels. This includes educational content from Khan Academy, MIT, and Harvard, as well as material from media giants like The Wall Street Journal and the BBC, and popular creators such as MrBeast, Marques Brownlee, Jacksepticeye, and PewDiePie.

This data was then employed to train their generative AI models, raising significant ethical questions about the methods used by these corporations to maintain a competitive edge in the rapidly evolving AI sector.

Dave Wiskus, CEO of Nebula, condemned these practices, stating, "It's theft. Using creators' work without their consent is deeply disrespectful, especially when companies aim to use generative AI to potentially replace artists."

David Pakman of "The David Pakman Show" expressed similar concerns: "No one asked for my permission. This is my livelihood, and I invest significant time and resources into my content. It's a blatant disregard for the creators' hard work."

The report indicates that EleutherAI, the organization behind the YouTube Subtitles dataset, did not respond to inquiries about the findings or the legality of their methods. The dataset is part of a larger collection called The Pile, which includes diverse sources such as European Parliament proceedings, English Wikipedia, and Enron Corporation emails.

Additionally, research papers from these tech companies show they have openly detailed using The Pile for training their AI models. Apple, for instance, used The Pile to train OpenELM, a significant AI model launched in April, shortly before unveiling new AI features for its products. Salesforce also confirmed leveraging The Pile for developing AI models aimed at academic and research purposes.