TechRadar

Big Tech’s unspoken rule: using online content and copyrighted material to train AI is seemingly the norm - and it doesn’t look like that’s about to change

By Kristina Terech,

1 day ago

This week, we learned that huge tech corporations, such as Apple , Nvidia , and Anthropic , allegedly use information like the subtitles and transcripts of YouTube videos to train their AI models.

Some of the creators of these videos reacted to the news that their content was used in this way with disappointment and frustration, and understandably so. While they agreed to YouTube’s terms of service, which may include implicit agreement that content could be used in ways like this, they put a ton of work into their videos, and that’s gone on to be used and maybe even sold without the original creators seeing compensation or even credit.

Unfortunately, I don’t think this will be an isolated incident - instead, it strikes me as a demonstration of an unspoken rule of tech companies that are developing AI models, and as a supervisor working in this area in Amazon allegedly told an ex-employee when instructing her to ignore potential copyright-related issues, “everyone’s doing it.”

https://img.particlenews.com/image.php?url=2cHEvR_0uXm9NwF00 — (Image credit: Shutterstock/Ground Picture)

A more critical look at training data

Ironically, a few months ago, I sang Apple’s praises about how it seemed like the company was building an AI while keeping ethical considerations of this kind at the core of its AI software development. I was particularly impressed by the thought that Apple was taking this approach, considering how rival AI models, particularly large language models (LLMs) , are being trained as part of their development using material from people who may not have consented to their work being used in that way.

In short, an important aspect of developing LLMs is putting in vast amounts of information (called training data) that they “learn” from and improve to produce coherent and convincing human-like responses. It helps to put human speech (and writing) in to get human-like speech. To get better quality human-like responses capable of emulating well-written, informed, and possibly more interesting responses, LLM developers input written materials such as books, website content, and social media posts - much of which is protected by copyright.

https://img.particlenews.com/image.php?url=4LtGJo_0uXm9NwF00 — (Image credit: Shutterstock/Motion Box)

Navigating the ethical and legal complexities of training data

In my article about Apple’s seemingly ethical approach , I went into some detail about the lawsuits mounted by the New York Times and a number of prominent authors against companies like Microsoft , OpenAI , Meta, Alphabet (parent company of Google), and others are facing regarding possible copyright infringement.

Critics of this practice say that it could be considered copyright infringement if these tech companies haven’t gotten the explicit consent of the respective copyright holders or their legal representatives. However, these misgivings do not discourage the industry leader in consumer AI products, such as OpenAI (the company behind ChatGPT ). A spokesperson for the company wrote the following about the issue as part of evidence that was submitted to the UK’s House of Lords communications and digital committee, as reported by the Telegraph :

“Because copyright today covers virtually every sort of human expression – including blog posts, photographs, forum posts, scraps of software code, and government documents – it would be impossible to train today’s leading AI models without using copyrighted materials.”

The spokesperson for OpenAI went on to state that the company complies with all copyright laws when using copyrighted material in the training of its AI models and that it believes “that legally copyright law does not forbid training.”

The report about the use of YouTube video material comes from Wired and Proof News , who allege that using this material without creators' permission violates YouTube’s rules. This material is part of a data set named the Pile, which is built by EleutherAI , a nonprofit research lab that claims to want to lower the barriers to AI development.

Apple has stepped forward to clarify that it used Pile data to train its research models, including OpenELM, for the end goal of learning about LLMs and not to train Apple Intelligence (Apple’s AI that’s developed specifically for use on Apple products).

This means that if YouTube’s rules were broken, they were broken by EleutherAI, and EleutherAI would face any related litigation. I don’t know if that totally absolves the tech firms that use the ripped YouTube data, but it demonstrates how complex the ethical and legal ramifications of this practice can and will become very quickly – and this is just one particular instance.

https://img.particlenews.com/image.php?url=0LQyd8_0uXm9NwF00 — (Image credit: Shutterstock/Tada Images)

As AI evolves rapidly, will the ethics and laws evolve with it?

“If you are not paying for it, you’re not the customer; you’re the product being sold.”

This sentiment has been around since the 1970s, but the above version was left as a comment about an article discussing the news aggregator website, Digg , in 2010 and has been repeated (or at least paraphrased) often when speaking about many digital and internet products since. In the Reddit thread about the article written by Wired and Proof, this is a common sentiment.

I’m not saying I agree with it, and, personally, I fall on the side of people who feel that it is copyright infringement, but companies (not just tech companies) love new technology, which means they can pay less for human labor while continuing to increase output and revenue. Furthermore, many governments and regulatory bodies are often slow on the uptake when it comes to enacting new regulations and legal frameworks that emerging technologies can exist within.

So, we can feel as negatively about it as we like, but I don’t think that’ll stop tech companies from continuing this practice. Frankly, I think they hope their products become so entrenched in our lives that even if ethical or legal considerations catch up with them, we’ll want to continue using them anyway.

I know I sound cynical - and I also don’t have a functional crystal ball. Maybe the sentiment will turn; maybe AI technology will bring so much good into the world that it outweighs the negatives. Maybe, maybe, maybe… We’ll have to continue watching how AI evolves. What I can say with some confidence is that AI's presence will become increasingly significant in our lives, and there will likely be unintended consequences – both positive and negative. Because of this, there will come a time when we’ll have to really understand and address these consequences thoughtfully and proactively, but I don’t think we’ve reached that point yet.

You might also like...

Expand All

Read in NewsBreak

Comments / 0

Add a Comment

TechRadar2 days ago

Google Maps just added a key missing feature in CarPlay – but not Android Auto

TechRadar2 days ago

Google is closing its URL shortening service

TechRadar2 days ago

Is Your Data On The Dark Web? Google Now Offers Free Monitoring For All

Hot Hardware10 days ago

Google may be planning a Gemini-powered rival to the Ray-Ban Meta Smart Glasses

TechRadar8 hours ago

Chelsea Clinton is rumored to be living in Virginia but probably still in NYC

New York City, NY7 days ago

1 Incredibly Cheap Artificial Intelligence (AI) Stock to Buy Before It Skyrockets

Motley Fool12 hours ago

The Bold and the Beautiful Spoilers: Thomas was not sincere in his proposal to Hope on Friday

Virginia State15 days ago

Got Tubi? Here are three sci-fi spectaculars with over 90% on Rotten Tomatoes that you can stream for free

TechRadar1 day ago

5 Zodiac Signs Who Are Most Likely to Win Arguments

Total Apex Sports & Entertainment23 days ago

The Boys season 4 ending explained: who dies, is there a mid-credits scene, and your biggest questions answered

TechRadar3 days ago

13 Valuable Items Lurking in Your Home

Anika Jindal24 days ago

The mass IT outage is causing chaos, but social media users are having a field day

Business Insider2 days ago

US Postal Service Caught Sharing Customer Data to Social Media, Advertisers

itechpost.com2 days ago

Did You Feel the Earthquake?

Chicago, IL6 days ago

This is why Y2K24 was so much worse than the real Y2K

TechRadar1 day ago

10 AI Stocks to Buy With $1000 Right Now

24/7 Wall St.7 days ago

5 Zodiac Signs Who Get the Best Sleep

Total Apex Sports & Entertainment9 days ago

Cloned on Facebook? Here’s how to take back control

cyberguy.com8 days ago

Man awarded $300,000 after being wrongfully detained because of facial recognition

Detroit, MI17 days ago

Welcome to NewsBreak, an open platform where diverse perspectives converge. Most of our content comes from established publications and journalists, as well as from our extensive network of tens of thousands of creators who contribute to our platform. We empower individuals to share insightful viewpoints through short posts and comments. It’s essential to note our commitment to transparency: our Terms of Use acknowledge that our services may not always be error-free, and our Community Standards emphasize our discretion in enforcing policies. We strive to foster a dynamic environment for free expression and robust discourse through safety guardrails of human and AI moderation. Join us in shaping the news narrative together.

Comments / 0

Community Policy