BGR.com

This Apple AI study suggests ChatGPT and other chatbots can’t actually reason

By Chris Smith,

1 days ago

https://img.particlenews.com/image.php?url=2UqQ15_0w8OC0sp00

Companies like OpenAI and Google will tell you that the next big step in generative AI experiences is almost here. ChatGPT’s big o1-preview upgrade is meant to prove that next-gen experience. o1-preview, available to ChatGPT Plus and other premium subscribers, can supposedly reason. Such an AI tool should be more useful when trying to find solutions to complex questions that require complex reasoning.

But if a new AI paper from Apple researchers is correct in its conclusions, then ChatGPT o1 and all other genAI models can’t actually reason. Instead, they’re simply matching patterns from their training data sets. They’re pretty good at coming up with solutions and answers, yes. But that’s only because they’ve seen similar problems and can predict the answer.

Apple’s AI study shows that changing trivial variables in math problems that wouldn’t fool kids or adding text that doesn’t alter how you’d solve the problem can significantly impact the reasoning performance of large language models.

iPad mini 7’s $499 price tag is incredible news for the iPhone SE 4

Apple’s study , available as a pre-print version at this link , details the types of experiments the researchers ran to see how the reasoning performance of various LLMs would vary. They looked at open-source models like Llama, Phi, Gemma, and Mistral and proprietary ones like ChatGPT o1-preview, o1 mini, and GPT-4o.

Apple’s App Store is down, and thousands of users are complaining

The conclusions are identical across tests: LLMs can’t really reason. Instead, they’re trying to replicate the reasoning steps they might have witnessed during training.

The scientists developed a version of the GSM8K benchmark, a set of over 8,000 grade-school math word problems that AI models are tested on. Called GSM-Symbolic, Apple tests involved making simple changes to the math problems, like modifying the characters’ names, relationships, and numbers.

M4 iPad Pro: Release date, OLED, price, features, iPadOS 18, Apple Pencil Pro, more

The image in the following tweet offers an example of that. “Sophie” is the main character of a problem about counting toys. Replacing the name with something else and changing the numbers should not alter the performance of reasoning AI models like ChatGPT . After all, a grade schooler could still solve the problem even after changing these details.

The Apple scientists showed that the average accuracy dropped by up to 10% across all models when dealing with the GSM-Symbolic test. Some models did better than others, with GPT-4o dropping from 95.2% accuracy in GSM9K to 94.9% in GSM-Symbolic.

That’s not the only test that Apple performed. They also gave the AIs math problems that included statements that were not really relevant to solving the problem.

Here’s the original problem that the AIs would have to solve:

Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday. How many kiwis does Oliver have?

Here’s a version of it that contains an inconsequential statement that some kiwis are smaller than others:

Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picked double the number of kiwis he did on Friday, but five of them were a bit smaller than average. How many kiwis does Oliver have?

The result should be identical in both cases, but the LLMs subtracted the smaller kiwis from the total. Apparently, you don’t count the smaller fruit if you’re an AI with reasoning abilities.

Adding these “seemingly relevant but ultimately inconsequential statements” to GSM-Symbolic templates leads to “catastrophic performance drops” for the LLMs. Performance for some models dropped by 65%. Even o1-preview struggled, showing a 17.5% performance drop compared to GSM8K.

Interestingly, I tested the same problem with o1-preview, and ChatGPT was able to reason that all fruits are countable despite their size.

https://img.particlenews.com/image.php?url=3UCyyE_0w8OC0sp00

Apple researcher Mehrdad Farajtabar has a thread on X that covers the kind of changes Apple performed for the new GSM-Symbolic benchmarks that include additional examples. It also covers the changes in accuracy. You’ll find the full study at this link .

Apple isn’t going after rivals here; it’s simply trying to determine whether current genAI tech allows these LLMs to reason. Notably, Apple isn’t ready to offer a ChatGPT alternative that can reason.

That said, it’ll be interesting to see how OpenAI, Google, Meta, and others challenge Apple’s findings in the future. Perhaps they’ll devise other ways to benchmark their AIs and prove they can reason. If anything, Apple’s data might be used to alter how LLMs are trained to reason, especially in fields requiring accuracy.

Don't Miss : Scary new Gmail hack uses super realistic AI posing as Google to dupe you

The post This Apple AI study suggests ChatGPT and other chatbots can’t actually reason appeared first on BGR .

Comments /

Add a Comment

YOU MAY ALSO LIKE

Local News

The #1 Netflix series in the US right now has a perfect 100% critics’ score

BGR.com2 days ago

A supermoon will fill the sky later this week

BGR.com1 day ago

‘Skyquake’ is a mysterious phenomenon that’s leaving scientists baffled

BGR.com8 days ago

Earth Will Have a "Second Moon" for 57 Days: What does this mean?

M Henderson12 days ago

The iPad mini Pro of your dreams is here thanks to these 8 features

BGR.com1 day ago

Slow Horses Season 5 won’t hit Apple TV+ until 2025, but you can check out the first trailer now

BGR.com7 days ago

Is this how Google fixes the big problem caused by its own AI photos?

BGR.com6 days ago

What the iPad mini 7 tells us about Apple’s unreleased M4 MacBook Pro

BGR.com15 hours ago

iOS 18.1 release date: Here’s when Apple Intelligence launches for iPhone users

BGR.com13 hours ago

4 Netflix original shows with new seasons coming this week

BGR.com2 days ago

Menendez brothers petition tops 400,000 signatures on Change.org after Netflix series

BGR.com6 days ago

Apple TV+ just confirmed its best show is getting a Season 6

BGR.com1 day ago

iPad mini 7’s $499 price tag is incredible news for the iPhone SE 4

BGR.com8 hours ago

Higher Death Rates Prompt Pfizer to Withdraw Medication Over Serious Safety Concerns

Uncovering Florida19 days ago

Today’s deals: $40 in Amazon credit, Nintendo Switch games, $20 waterproof Bluetooth speaker, more

BGR.com2 days ago

M4 iPad Pro isn’t selling well, and I blame iPadOS more than the price

BGR.com6 days ago

It’s not just me, other Apple Watch Series 10 owners are surprised by the battery life

BGR.com6 days ago

Beats Pill gets new colors in yet another Kim Kardashian collab

BGR.com1 day ago

Galaxy Fold 6 SE’s release date and price might have leaked

BGR.com1 day ago

DirecTV is launching a free, ad-supported streamer in November

BGR.com6 days ago

Verizon customers found a neat trick that cuts $10 or more off their bills

BGR.com1 day ago

Your toothbrush is covered with hundreds of viruses ‘unlike anything seen before’

BGR.com6 days ago

NASA may have figured out why Mars is so dead

BGR.com8 days ago

Once-in-a-lifetime comet will be visible starting this Friday

BGR.com7 days ago

Top Prime Day deals you can still get: $279 Oura Ring 3, Apple sale, Ninja air fryers, Sonos speakers, more

BGR.com6 days ago

My precious: I’d buy an Apple Ring the second it came out

BGR.com5 days ago

Two comb jellies stunned scientists by merging into one creature

BGR.com7 days ago

Inception in real life: Dreaming people communicated with each other for the first time

BGR.com1 day ago

Apple to launch smart glasses and other AI wearables in the next 3 years

BGR.com2 days ago

Imagen 3 image generation model is now available all Gemini users

BGR.com7 days ago

It’s essential to note our commitment to transparency:

Our Terms of Use acknowledge that our services may not always be error-free, and our Community Standards emphasize our discretion in enforcing policies. As a platform hosting over 100,000 pieces of content published daily, we cannot pre-vet content, but we strive to foster a dynamic environment for free expression and robust discourse through safety guardrails of human and AI moderation.

Comments / 0

Community Policy