TechRadar

Investigation finds companies are training AI models with YouTube content without permission

By Eric Hal Schwartz,

6 hours ago

Artificial intelligence models require as much useful data as possible to perform but some of the biggest AI developers are relying partly on transcribed YouTube videos without permission from the creators in violation of YouTube's own rules, as discovered in an investigation by Proof News and Wired.

The two outlets revealed that Apple, Nvidia, Anthropic, and other major AI firms have trained their models with a dataset called YouTube Subtitles incorporating transcripts from nearly 175,000 videos across 48,000 channels, all without the video creators knowing.

The YouTube Subtitles dataset comprises the text of video subtitles, often with translations into multiple languages. The dataset was built by EleutherAI, which described the dataset's goal as lowering barriers to AI development for those outside big tech companies. It's only one component of the much larger EleutherAI dataset called the Pile. Along with the YouTube transcripts, the Pile has Wikipedia articles, speeches from the European Parliament, and, according to the report, even emails from Enron.

However, the Pile has a lot of fans among the major tech companies. For instance, Apple employed the Pile to train its OpenELM AI model, while the Salesforce AI model released two years ago trained with the Pile and has since been downloaded more than 86,000 times.

The YouTube Subtitles dataset encompasses a range of popular channels across news, education, and entertainment. That includes content from major YouTube stars like MrBeast and Marques Brownlee. All of them have had their videos used to train AI models. Proof News set up a search tool that will search through the collection to see if any particular video or channel is in the mix. There are even a few TechRadar videos in the collection, as seen below.

https://img.particlenews.com/image.php?url=0q28W5_0uTWlmyn00 — (Image credit: Proof News)

The YouTube Subtitles dataset seems to contradict YouTube’s terms of service, which explicitly fobird automated scraping of its videos and associated data. That’s exactly what the dataset relied on, however, with a script downloading subtitles through YouTube’s API. The investigation reported that the automated download culled the videos with nearly 500 search terms.

The discovery provoked a lot of surprise and anger from the YouTube creators Proof and Wired interviewed. The concerns about the unauthorized use of content are valid, and some of the creators were upset at the idea their work would be used without payment or permission in AI models. That’s especially true for those who found out the dataset includes transcripts of deleted videos, and in one case, the data comes from a creator who has since removed their entire online presence.

The report didn’t have any comment from EleutherAI. It did point out that the organization describes its mission as democratizing access to AI technologies by releasing trained models. That may conflict with the interests of content creators and platforms, if this dataset is anything to go by. Legal and regulatory battles over AI were already complex. This kind of revelation will likely make the ethical and legal landscape of AI development more treacherous. It’s easy to suggest a balance between innovation and ethical responsibility for AI, but producing it will be a lot harder.

Expand All

Read in NewsBreak

Comments / 0

Add a Comment

TechRadar1 day ago

Tech Firms Including Apple Caught Using YouTube Data to Train AI Models

petapixel.com10 hours ago

Google's AI robots are learning from watching movies – just like the rest of us

TechRadar5 days ago

Google tests out Gemini AI-created video presentations

The Verge1 day ago

This Apple Safari privacy video is funny, creepy, and also true

TechRadar12 hours ago

Two Beloved Hollywood Actresses Have Passed Away This Week

Blanco, TX1 day ago

Favorite Chicken Chain Suddenly Closes All Stores, Heartfelt Message Found on Doors

Lancaster County, PA9 hours ago

24/7 Wall St.7 days ago

Nvidia's Biggest Rival Agrees to Buy an AI Startup for Over Half a Billion Dollars in Cash

Entrepreneur5 days ago

Future-Proof Your Portfolio: 3 Low-Cost AI Stocks to Buy in July

The Motley Fool2 days ago

Did You Feel the Earthquake?

Chicago, IL1 day ago

Chelsea Clinton is rumored to be living in Virginia but probably still in NYC

New York City, NY2 days ago

2 Artificial Intelligence (AI) Stocks to Buy Hand Over Fist Before They Soar by More Than 100%, According to Consensus Wall Street Estimates

Motley Fool5 days ago

Signs Your Cat Loves You

Vision Pet Care6 days ago

Prime Video just revealed the trailer and release date for a spin-off of one of its worst-rated shows

TechRadar12 hours ago

5 Super Semiconductor Stocks to Buy Hand Over Fist for the Artificial Intelligence (AI) Revolution

Motley Fool6 days ago

Nvidia could have big plans for powering up RTX 5000 GPUs (literally) – but new rumor might have you worried your power supply isn’t good enough

TechRadar1 day ago

Your Samsung Galaxy S24 and S23 will get a camera mode for truly celestial portraits

TechRadar16 hours ago

She interviewed for 15 AI roles before landing a Microsoft offer. It made her realize how the job market has changed.

Business Insider12 days ago

Forget the Samsung Galaxy Ring – the RingConn Gen 2 is less than half the price if you pre-order

TechRadar11 hours ago

Welcome to NewsBreak, an open platform where diverse perspectives converge. Most of our content comes from established publications and journalists, as well as from our extensive network of tens of thousands of creators who contribute to our platform. We empower individuals to share insightful viewpoints through short posts and comments. It’s essential to note our commitment to transparency: our Terms of Use acknowledge that our services may not always be error-free, and our Community Standards emphasize our discretion in enforcing policies. We strive to foster a dynamic environment for free expression and robust discourse through safety guardrails of human and AI moderation. Join us in shaping the news narrative together.

Comments / 0

Community Policy

Investigation finds companies are training AI models with YouTube content without permission

Secret Sharing

You might also like

Comments / 0