ITPro

What is WRAP and how can it help train AI more efficiently?

By George Fitzmaurice,

15 days ago

Generative AI is booming, though developers are quickly running into obstacles, from the high energy demands of AI compute to the complex infrastructure required to train systems.

For the latter, data is of the utmost importance. Stockpiles of clear, quality data are vital for those companies looking to train and build their own AI models. Getting data pools in order is a key part of the early development process.

One novel theory for making this process easier is web rephrase augmented pre-training (WRAP), a technique put forward by researchers at Apple and Carnegie Mellon University in a paper published earlier this year .

In it, the researchers noted that many large language models (LLMs) are trained on data scraped from the web that is often “ unstructured , noisy, and poorly phrased,” making it harder to use for training.

While synthetic data can be used to get around this problem, it can fall victim to bias. While the alternative practice of data curation to remove lower-quality data can be effective, the researchers put forward their own solution.

Rather than creating synthetic data, WRAP uses an “off-the-shelf instruction-tuned model prompted to paraphrase documents on the web in specific styles such as ‘like Wikipedia’ or in ‘question-answer format’ to jointly pre-train LLMs on real and synthetic rephrases.”

According to the report, WRAP sped up pretraining by about three times when used on a “naturally noisy” dataset.

How does WRAP work?

In the paper, researchers use the rephrasing process on ‘The Pile,’ a collection of datasets commonly used in AI according to senior research and development manager of data science at Synopsys, Dr Andrew Bolster.

“Some datasets are used as benchmarks to compare architectures and scales. One such collection of datasets is known as ‘The Pile,’” Bolster tells ITPro.

This is an 825GB collection of web scraped data, Bolster goes on, featuring content from a range of sites including PubMed, Github, Stack Exchange, HackerNews, and YouTube subtitles.

The researchers augment ‘The Pile’ by rephrasing large portions of it before combining these rephrased portions with the original dataset to train an LLM which answers questions, Bolster says. This LLM is then evalued for its zero-shot accuracy – meaning its capacity to answer questions not rooted in its training data.

“This is a form of ‘synthetic data augmentation’, where in any modeling system, there may not be sufficient ‘real’ input/training data to accurately converge a model’s behavior,” Bolster says.

“Data Scientists may simply ‘repeat’ the training data over and over again, hoping for the best, but over the past decade, this has largely been replaced with synthetic data generation involving the training of intermediate models to generate more data that ‘looks like’ the provided training set,” he adds.

The LLM trained on the rephrased data within the paper reportedly “outperforms other techniques” that fulfill the same need. There is a “small but clear improvement,” Bolster says.

WRAP appears to beat other natural language augmentation techniques such as synonym replacement and random word deletion, though Bolster is careful to point out that there are some evident downsides.

The pros and the cons of WRAP

For any businesses or developers looking to WRAP as a method of driving efficiency during AI model training, there are some key advantages and disadvantages to consider, particularly regarding proprietary data.

WRAP could work best for in-house AI training , says Stefan Leichenauer, vice president of engineering at SandboxAQ, but there may be “limited effectiveness out-of-the-box” when the method is applied to proprietary data .

“In-house training data may not be as naturally messy as what we find on the public internet,” Leichenauer tells ITPro . To fix this, he suggests businesses should convert their data into “something that is more in line with the end application we are interested in.”

“So, for example, if you are interested in training a customer service chatbot , then you should try transforming your internal documentation into a question-answer format before training the AI,” he says.

Messy data has always been a problem, Leichenauer adds, noting that WRAP is one of many tools developers can use to perform “initial data transformations” for more effective training.

On the other hand, Bolster notes that WRAP could cut the up-front costs of creating enterprise-grade LLMs, which is generally eaten up in the “establishment of curated / domain specific data”. In this sense, WRAP could have its advantages.

“This rephrase capability might be a valid method for making the most of limited data to train against,” Bolster says.

Having said that, WRAP has a “significant upfront cost” of its own – the generation of synthetic data. At present, there arealso sensitivities to rephrasing “styles” and model selection which could cause issues.

It’s clear that as WRAP matures, it could have a big impact on the way that companies pursue their own LLMs and refine their data to ensure the best ROI on their AI investments . Whether it will become as pivotal a development as the likes of retrieval-augmented generation (RAG) has yet to be seen, but businesses will be keenly investigating its benefits and drawbacks.

Expand All

Read in NewsBreak

Comments /

Add a Comment

YOU MAY ALSO LIKE

Local News

‘It ain’t food anymore.’: Dairy Queen customer warns against chicken after his dog refuses to eat it

NewsNinja17 days ago

Fentanyl-meth combo ravages homeless in Denver, so why aren't there better treatments?

David Heitz2 days ago

Anthropic wants to demystify the inner workings of its Claude AI models – and it might force OpenAI’s hand on transparency

ITPro11 days ago

Every household can get four free COVID-19 tests by mail, starting late September

Northern Kentucky Tribune2 days ago

Sticky fingers: Colorado ranks seventh in U.S. for retail theft, study shows

David Heitz11 days ago

August rundown: Who's afraid of remote work?

ITPro11 days ago

3 Zodiac Signs Whose Luck Improves | September 10, 2024

Total Apex Sports & Entertainment17 hours ago

It looks like we’re stuck with Windows Recall: Microsoft confirms option to uninstall was just a ‘bug’

ITPro6 days ago

OpenAI pledges support for AI watermarking rules

ITPro13 days ago

Pushing staff back to the office? You may want to reconsider – return to office mandates harm employee productivity and retention

ITPro12 days ago

Wonder Jelly

Alameda Post15 days ago

A cyber criminal group behind an MFA bypass operation promised hackers “profit within minutes” – they’re now facing lengthy jail sentences

ITPro5 days ago

SuiteWorld 2024 live: All the news and updates as they happen

ITPro22 hours ago

Big Lots files bankruptcy amid closing 74 stores in California

The HD Post19 hours ago

Empowering enterprises with AI: Entering the era of choice

ITPro4 hours ago

3 Lucky Zodiac Signs With Financial Abundance | September 9, 2024

Total Apex Sports & Entertainment23 hours ago

What the supply chain crisis taught us – and how businesses can prepare for the next one

ITPro8 days ago

AI is paying dividends for Dell Technologies – booming server sales and rapid networking growth have taken the edge off a rocky period

ITPro7 days ago

3 Zodiac Signs Who Overcome Hardships | September 10-15, 2024

Total Apex Sports & Entertainment17 hours ago

HPE ProLiant DL365 Gen11 review: A powerful rack server with a big EPYC processing density

ITPro13 days ago

Opinion: New Denver homeless hotel avoids Fusion Studios shortcomings

David Heitz9 days ago

Dell PowerEdge T160 review: A sterling silver server perfectly suited to small businesses

ITPro7 days ago

Keep The Kitchen Sink Area Decluttered & Organized

Declutterbuzz5 days ago

California lawmakers approve sweeping AI legislation – here's what it means for the industry

ITPro12 days ago

British SMBs are glaringly unprotected – will the new Cyber Security and Resilience Bill be enough to raise the bar?

ITPro6 hours ago

Simplified SSI Application: Streamlined Process for Social Security Benefits

Morristown Minute5 days ago

Blacks and Hispanics in North Carolina Earn Lower Wages than Whites

Town Talks7 days ago

How Legal Cannabis Could Help Your Property Value Grow

Morristown Minute5 days ago

Elastic returns to open source, but can it regain the community’s trust? Some industry players aren’t holding their breath

ITPro4 days ago

Opinion: Moving homeless along: SET not a ‘threat’ team in Denver

David Heitz23 days ago

It’s essential to note our commitment to transparency:

Our Terms of Use acknowledge that our services may not always be error-free, and our Community Standards emphasize our discretion in enforcing policies. As a platform hosting over 100,000 pieces of content published daily, we cannot pre-vet content, but we strive to foster a dynamic environment for free expression and robust discourse through safety guardrails of human and AI moderation.

Comments / 0

Community Policy