Business Insider

The AI world's most valuable resource is running out, and it's scrambling to find an alternative: 'fake' data

By Hasan Chowdhury,Hugh Langley,

2024-08-09

https://img.particlenews.com/image.php?url=1W0J9I_0usbKQpg00 — AI leaders like ChatGPT's boss, Sam Altman, are hoping synthetic data will make their AI models smarter.
Getty Images; Chelsea Jia Feng/BI

The AI industry has a major problem: The real-world data used to make smarter models is running out.
Companies scrambling for an alternative think synthetic data could offer a solution.
Research suggests synthetic data could poison AI with low-quality information.

The AI world is on the cusp of running out of its most valuable resource — and it's leading industry leaders into a fierce debate over a fast-growing alternative being touted as a replacement: synthetic data, or essentially "fake" data.

For years, the likes of OpenAI and Google have scraped data from the internet to train the large language models that power their AI tools and features. These LLMs digested reams of text, video, and other media online produced by humans over centuries — be it research papers, novels, or YouTube clips.

Now, the supply of "real," human-generated data is running dry. The research firm Epoch AI predicts textual data could run out by 2028 . Meanwhile, companies that have mined every corner of the internet for usable training data — sometimes breaking their policies to do so — face increased restrictions on what remains.

To some, that's not necessarily a problem. OpenAI CEO Sam Altman has argued that AI models should eventually produce synthetic data good enough to train themselves effectively . The allure is obvious: Training data has become one of the most precious resources in the AI boom, and the prospect of generating it cheaply and seemingly infinitely is tantalizing.

Still, researchers debate whether synthetic data is the magic bullet, with some arguing this path could lead to AI models poisoning themselves with poor-quality information and that they could "collapse" as a result.

A recent paper published by a group of Oxford and Cambridge researchers said that feeding a model with AI-generated data eventually led it to produce gibberish . AI-generated data was not unusable for training, the authors found, and should be balanced with real-world data.

As the well of usable human-generated data dries up, more companies look into using synthetic data. In 2021, the research firm Gartner predicted that by 2024, 60% of data used for developing AI would be synthetically generated.

"It's a crisis," said Gary Marcus, an AI analyst and professor emeritus of psychology and neural science at New York University. "People had the illusion that you could infinitely make large language models better by just using more and more data, but now they've basically used all the data they can."

He added: "Yes, it will help you with some problems, but the deeper problem is that these systems don't really reason; they don't really plan. All the synthetic data you can imagine is not going to solve that foundational problem."

More companies create synthetic data

The need for "fake" data hinges on the notion that real-world data is quickly running out.

This is partly because tech firms have been moving as fast as possible to use publicly available data to train AI in an effort to outsmart rivals. It's also because online data owners have become increasingly wary of companies taking their data for free.

OpenAI researchers revealed in 2020 how they used free data from Common Crawl, a web crawler that the AI company said contained "nearly a trillion words" from online resources, to train the AI model that would eventually power ChatGPT .

Research published in July by the Data Provenance Initiative found websites were putting restrictions in place to stop AI firms from using data that didn't belong to them. News publications and other top sites are increasingly blocking AI companies from freely cribbing their data.

To get around this problem, companies such as OpenAI and Google are cutting checks for tens of millions of dollars for access to data from Reddit and news outlets , which act as conveyor belts of fresh data for training models. Even this has its limitations.

"There are no longer major areas of the textual web just waiting to be grabbed," Nathan Lambert, a researcher at the Allen Institute for AI, wrote in May.

This is where synthetic data comes in. Rather than being pulled from the real world, synthetic data is generated by AI systems that have been trained on real-world data.

In June, for instance, Nvidia released an AI model that can create artificial datasets for training and alignment. In July, researchers at the Chinese tech giant Tencent created a synthetic-data generator called Persona Hub , which does a similar job.

Some startups, such as Gretel and SynthLabs, are even popping up with the sole purpose of generating and selling troves of specific types of data to companies that need it.

https://img.particlenews.com/image.php?url=4H16GT_0usbKQpg00 — A chat powered by Meta's Llama 3 AI model.
Anadolu/Getty Images

Proponents of synthetic data offer fair reasons for its use. Like the real world, human-generated data is often messy, leaving researchers with the complex and laborious task of cleaning and labeling it before it can be used.

Synthetic data may fill holes that human data cannot. In late July, Meta introduced Llama 3.1 , a new series of AI models that generate synthetic data and rely on it for "fine-tuning" in training. In particular, it used the data to improve the performance of specific skills, such as coding in languages like Python , Java, and Rush, as well as solving math problems.

Synthetic training could be particularly effective for smaller AI models. Microsoft last year said it gave OpenAI's models a diverse list of words that a typical 3- to 4-year-old would know and then asked it to generate short stories using that data. The resulting dataset was used to create a group of small but capable language models.

Synthetic data may help offer some effective "countertuning" to the biases produced by real-world data, too. In their 2021 paper, "On the Dangers of Stochastic Parrots ," the former Google researchers Timnit Gebru , Margaret Mitchell, and others said that LLMs trained on massive datasets of text from the internet would likely reflect the data's biases.

In April, a group of Google DeepMind researchers published a paper championing the use of synthetic data to address problems around data scarcity and privacy concerns in training, saying that ensuring the accuracy and lack of bias in this AI-generated data "remains a critical challenge."

'Habsburg AI'

While the AI industry found some advantages in synthetic data, it faces serious issues it can't afford to ignore, such as fears synthetic data can wreck AI models.

In Meta's research paper on Llama 3.1 , the company said that training the 405 billion-parameter version of the latest model "on its own generated data is not helpful" and may even "degrade performance."

A study published in the journal Nature last month found that "indiscriminate use" of synthetic data in model training could cause "irreversible defects." The researchers called this phenomenon "model collapse" and said that the problem must be taken seriously "if we are to sustain the benefits of training from large-scale data scraped from the web."

Jathan Sadowski, a senior research fellow at Monash University, coined a term for this idea: Habsburg AI , in reference to the Austrian dynasty that some historians believe destroyed itself through inbreeding. Since coining the term , Sadowski told Business Insider he has felt validated by the research backing his assertion that models heavily trained on AI outputs could become mutated.

"The open question for researchers and companies building AI systems is, how much synthetic data is too much?" Sadowski said. "They need to find any possible solution to overcome the challenges of data scarcity for AI systems," he added, noting that some of the solutions may turn out to be short-term fixes that do more harm than good.

However, research published in April found that models trained on their own generated data didn't necessarily need to "collapse" if they were trained with both "real" and synthetic data. Now, some companies are betting on a future of "hybrid data," where synthetic data is generated by using some real data in an effort to stop the model from going off-piste.

Scale AI, which helps companies label and test data, said the company was exploring "the direction of hybrid data," using both synthetic and nonsynthetic data. (Scale AI CEO Alexandr Wang recently said : "Hybrid data is the real future.")

In search of other solutions

AI may require new approaches, as simply jamming more data into models may only go so far.

A group of Google DeepMind researchers may have proved the merits of another approach in January when the company announced AlphaGeometry, an AI system that can solve geometry problems at an Olympiad level.

In a supplemental paper, the researchers said AlphaGeometry used a "neuro-symbolic" approach, which meshes the strengths of other AI approaches, landing somewhere between data-hungry deep-learning models and rule-based logical reasoning. IBM's research group said it saw it as "a pathway to achieve artificial general intelligence."

What's more, in the case of AlphaGeometry, it was pretrained on entirely synthetic data .

The neuro-symbolic field of AI is relatively young, and it remains to be seen whether it will propel AI forward.

Given the pressures companies such as OpenAI, Google, and Microsoft face in turning AI hype into profits, expect them to try every solution possible to solve the data crisis.

"We're still basically going to be stuck here unless we take new approaches altogether," Marcus said.

Read the original article on Business Insider

Expand All

Read in NewsBreak

Comments / 5

Add a Comment

BERSERKER

08-11

It's time to pull the plug on this crap

TRW

08-11

Stupidity ? 2024

View all comments

YOU MAY ALSO LIKE

Local News

The 15 companies tech students around the world most want to work for

Business Insider1 day ago

OpenAI takes another step closer to getting AI to think like humans with new 'o1' model

Business Insider1 day ago

I went to Big Lots and saw why the chain is closing stores and filing for bankruptcy

Business Insider9 hours ago

Student-loan borrowers are getting $100 million in payments after being 'cheated' out of lower bills by a major lender, a federal consumer watchdog says

Business Insider1 day ago

A Navy SEAL unit that killed Osama bin Laden is busy preparing for a possible Chinese invasion of Taiwan: report

Business Insider1 day ago

A Walmart exec explains the surprising reason low-income shoppers pay for a delivery membership

Business Insider1 day ago

A sinkhole in South Dakota is packed with mammoth fossils that experts have been digging up for half a century. Take a look.

Business Insider1 day ago

Satellite images show the Russian cargo ship that transported ballistic missiles from Iran

Business Insider2 days ago

A California police department spent at least $140,000 on a custom Cybertruck: report

Business Insider1 day ago

German warships ignored China's complaints and sailed through the Taiwan Strait for the first time in over 2 decades

Business Insider12 hours ago

Math Puzzle for September 6, 2024

Alameda Post7 days ago

Mortgage Interest Rates Today, September 11, 2024 | Rates Down in Anticipation of Fed Cuts

Business Insider2 days ago

Fentanyl-meth combo ravages homeless in Denver, so why aren't there better treatments?

David Heitz6 days ago

Health officials report first case of Oropouche virus, aka ‘Sloth Fever,’ confirmed in Kentucky

Northern Kentucky Tribune9 days ago

Over 80 Cruise Passengers Seeking Compensation from Cruise Line After Getting Sick

J. Souza10 days ago

My 56-year-old mom lives rent-free in our house. She contributes by doing chores and childcare, but communicating expectations can be hard.

Business Insider14 hours ago

Potato-shaped critter visible now may predict climate change in Colorado

David Heitz21 days ago

Florida steakhouse closed again due to dead roaches, bad odors, live insects and roaches

Akeena29 days ago

The inflation report dashes hopes of a jumbo rate cut next week

Business Insider2 days ago

Russia fired a new North Korean missile made this year in a strike on Ukraine, marking a first, researchers say

Business Insider2 days ago

Americans are turning to new ways to cut expenses right now

Business Insider1 day ago

New report says 2 SoCal counties are most vulnerable to housing market troubles

The HD Post5 days ago

Every household can get four free COVID-19 tests by mail, starting late September

Northern Kentucky Tribune6 days ago

Starbucks' new CEO hints the chain may have become a little too convenient

Business Insider2 days ago

Elon Musk now travels with up to 20 bodyguards who refer to him by the code name 'Voyager,' report says

Business Insider5 hours ago

Moldy Meat and Warm Chicken: Walmart Neighborhood Market Faces Inspection Troubles in Florida

Akeena20 days ago

Your old Facebook and Instagram posts were probably used to train Meta's AI. You can't opt-out as an American, but you can do this instead.

Business Insider1 day ago

Berkshire's vice-chair just cut his stake in Buffett's company

Business Insider22 hours ago

Carnival Responds to Cruise Passenger Wanting to Remove All Tips from Bill

J. Souza22 days ago

Russia is counterattacking in Kursk and could be trying to split Ukrainian forces, military experts say

Business Insider1 day ago

It’s essential to note our commitment to transparency:

Our Terms of Use acknowledge that our services may not always be error-free, and our Community Standards emphasize our discretion in enforcing policies. As a platform hosting over 100,000 pieces of content published daily, we cannot pre-vet content, but we strive to foster a dynamic environment for free expression and robust discourse through safety guardrails of human and AI moderation.

Comments / 0

Community Policy