LiveScience

GPT-4 didn't ace the bar exam after all, MIT research suggests — it didn't even break the 70th percentile

By Ben Turner,

2024-05-31

GPT-4 didn't actually score in the top 10% on the bar exam after all, new research suggests.

OpenAI, the company behind the large language model (LLM) that powers its chatbot ChatGPT, made the claim in March last year, and the announcement sent shock waves around the web and the legal profession.

Now, a new study has revealed that the much-hyped 90th-percentile figure was actually skewed toward repeat test-takers who had already failed the exam one or more times — a much lower-scoring group than those who generally take the test. The researcher published his findings March 30 in the journal Artificial Intelligence and Law .

"It seems the most accurate comparison would be against first-time test takers or to the extent that you think that the percentile should reflect GPT-4's performance as compared to an actual lawyer; then the most accurate comparison would be to those who pass the exam," study author Eric Martínez , a doctoral student at MIT's Department of Brain and Cognitive Sciences, said at a New York State Bar Association continuing legal education course .

To arrive at its claim, OpenAI used a 2023 study in which researchers made GPT-4 answer questions from the Uniform Bar Examination (UBE). The AI model's results were impressive: It scored 298 out of 400, which placed it in the top tenth of exam takers.

But it turns out the artificial intelligence (AI) model only scored in the top 10% when compared with repeat test takers. When Martínez contrasted the model's performance more generally, the LLM scored in the 69th percentile of all test takers and in the 48th percentile of those taking the test for the first time.

Martínez's study also suggested that the model's results ranged from mediocre to below average in the essay-writing section of the test. It landed in the 48th percentile of all test takers and in the 15th percentile of those taking the test for the first time.

To investigate the results further, Martínez made GPT-4 repeat the test again according to the parameters set by the authors of the original study. The UBE typically consists of three components: the multiple-choice Multistate Bar Examination (MBE); the Multistate Performance Test (MPT) that makes examinees perform various lawyering tasks; and the written Multistate Essay Examination (MEE).

Martínez was able to replicate the GPT-4’s score for the multiple-choice MBE but spotted "several methodological issues" in the grading of the MPT and MEE parts of the exam. He noted that the original study did not use essay-grading guidelines set by the National Conference of Bar Examiners, which administers the bar exam. Instead, the researchers simply compared answers to "good answers" from those in the state of Maryland.

This is significant. Martínez said that the essay-writing section is the closest proxy in the bar exam to the tasks performed by a practicing lawyer, and it was the section of the exam the AI performed the worst in.

"Although the leap from GPT-3.5 was undoubtedly impressive and very much worthy of attention, the fact that GPT-4 particularly struggled on essay writing compared to practicing lawyers indicates that large language models, at least on their own, struggle on tasks that more closely resemble what a lawyer does on a daily basis," Martínez said.

The minimum passing score varies from state to state between 260 and 272 , so GPT-4's essay score would have to be disastrous for it to fail the overall exam. But a drop in its essay score of just nine points would drag its score to the bottom quarter of MBE takers and beneath the fifth percentile of licensed attorneys, according to the study.

LiveScience12 days ago

LA may be spared 'horrifying' fate of the 'Big One' from San Andreas, simulation suggests

Los Angeles, CA11 days ago

Nearly half a million 'invasive' owls, including their hybrid offspring, to be killed by US

California State5 days ago

Lion Man: The oldest known evidence of religious belief in the world

LiveScience10 days ago

Is MSG bad for you?

LiveScience11 days ago

Did You Feel the Earthquake?

Chicago, IL8 days ago

Panda ant: The wasps whose black and white females have giant stingers and parasitic babies

LiveScience11 days ago

Red handfish: A tiny, moody fish with hands for fins and an extravagant mohawk

LiveScience4 days ago

July's full 'Buck Moon' rises this week — and signals a big lunar transition is on the way

LiveScience8 days ago

Earth is wobbling and days are getting longer — and humans are to blame

LiveScience7 days ago

Massive 100-inch transparent screen set to enter production — scientists claim it will be 10 times cheaper than transparent OLEDs

LiveScience7 days ago

Kim Kardashian and Kanye West’s Son Diagnosed With Rare Skin Condition

Total Apex Sports & Entertainment1 day ago

Mysterious Maya underground structure unearthed in Mexico

LiveScience10 days ago

Ultra-rare whale never seen alive washes up on on New Zealand beach — and scientists could now dissect it for the 1st time

LiveScience5 days ago

Paleo-Arabic inscription on rock was made by Prophet Muhammad's companion before he converted, study finds

LiveScience12 days ago

NJ Residents Arrested in Illegal Aviation Technology Export Scheme

Rockaway, NJ20 days ago

'ChatGPT moment for biology': Ex-Meta scientists develop AI model that creates proteins 'not found in nature'

LiveScience6 days ago

Plague strikes person in Colorado

Pueblo County, CO13 days ago

Severely injured giraffe with 'very twisted' zigzag neck spotted in South Africa

Tennessee State1 day ago

Welcome to NewsBreak, an open platform where diverse perspectives converge. Most of our content comes from established publications and journalists, as well as from our extensive network of tens of thousands of creators who contribute to our platform. We empower individuals to share insightful viewpoints through short posts and comments. It’s essential to note our commitment to transparency: our Terms of Use acknowledge that our services may not always be error-free, and our Community Standards emphasize our discretion in enforcing policies. We strive to foster a dynamic environment for free expression and robust discourse through safety guardrails of human and AI moderation. Join us in shaping the news narrative together.

Comments / 0

Community Policy