Tom's Hardware

Chinese AI models storm Hugging Face's LLM chatbot benchmark leaderboard — Alibaba runs the board as major US competitors have worsened

By Dallin Grimm,

7 days ago

Hugging Face has released its second LLM leaderboard to rank the best language models it has tested. The new leaderboard seeks to be a more challenging uniform standard for testing open large language model (LLM) performance across a variety of tasks. Alibaba's Qwen models appear dominant in the leaderboard's inaugural rankings, taking three spots in the top ten.

Hugging Face's second leaderboard tests language models across four tasks: knowledge testing, reasoning on extremely long contexts, complex math abilities, and instruction following. Six benchmarks are used to test these qualities, with tests including solving 1,000-word murder mysteries, explaining PhD-level questions in layman's terms, and most daunting of all: high-school math equations. A full breakdown of the benchmarks used can be found on Hugging Face's blog .

The frontrunner of the new leaderboard is Qwen, Alibaba 's LLM, which takes 1st, 3rd, and 10th place with its handful of variants. Also showing up are Llama3-70B, Meta's LLM, and a handful of smaller open-source projects that managed to outperform the pack. Notably absent is any sign of ChatGPT; Hugging Face's leaderboard does not test closed-source models to ensure reproducibility of results.

Tests to qualify on the leaderboard are run exclusively on Hugging Face's own computers, which according to CEO Clem Delangue's Twitter, are powered by 300 Nvidia H100 GPUs. Because of Hugging Face's open-source and collaborative nature, anyone is free to submit new models for testing and admission on the leaderboard, with a new voting system prioritizing popular new entries for testing. The leaderboard can be filtered to show only a highlighted array of significant models to avoid a confusing glut of small LLMs.

As a pillar of the LLM space, Hugging Face has become a trusted source for LLM learning and community collaboration. After its first leaderboard was released last year as a means to compare and reproduce testing results from several established LLMs, the board quickly took off in popularity. Getting high ranks on the board became the goal of many developers, small and large, and as models have become generally stronger, 'smarter,' and optimized for the specific tests of the first leaderboard, its results have become less and less meaningful, hence the creation of a second variant.

Some LLMs, including newer variants of Meta's Llama, severely underperformed in the new leaderboard compared to their high marks in the first. This came from a trend of over-training LLMs only on the first leaderboard's benchmarks, leading to regressing in real-world performance. This regression of performance, thanks to hyperspecific and self-referential data, follows a trend of AI performance growing worse over time , proving once again as Google's AI answers have shown that LLM performance is only as good as its training data and that true artificial "intelligence" is still many, many years away.

Expand All

Read in NewsBreak

Comments / 0

Add a Comment

Tom's Hardware1 day ago

Horoscope for Friday, July 5th

Devra Lee5 hours ago

Fans of The Young and the Restless are concerned about Brtyni Sarpy and Conner Floyd's status

Virginia State5 days ago

3 Zodiac Signs Who Are Two-Faced

Total Apex Sports & Entertainment20 days ago

New Laws July 1: Driver's Licenses for Undocumented Migrants, Raises for Tipped Workers, and More

Chicago, IL2 days ago

Shakey's Pizza Parlor and nostalgia: the first franchised pizza restaurant also landed in Missouri

Missouri State26 days ago

Eight '793 Bloods' Gang Members Charged in Newark Drug Network

Newark, NJ19 days ago

The Bold and the Beautiful shocker: Dr. Li Finnegan broke her Hippocratic Oath by watching Tom die

Virginia State22 hours ago

Still Need Your Landline?

California State9 days ago

Surgeon Guilty of Illegal Narcotic Prescriptions, Faces 20 Years

Clifton, NJ22 days ago

U.S. government addresses critical workforce shortages for the semiconductor industry with new program

Tom's Hardware2 days ago

Yes, that's a cliff: the Undercliff Grill & Bar in Joplin, Missouri

Joplin, MO11 days ago

Walmart Pays $1.64M Settlement for Unlawful Pricing in NJ

Morristown Minute16 days ago

California governor moves to ban smartphone use in schools

California State15 days ago

Virginia pastor reveals half of his African American congregation has hypertension

Norfolk, VA22 days ago

Retro burger joints: Big Boy Burgers and Paul's Drive-in in the Kansas City area

Independence, MO14 days ago

Georgia's Population Surges Past 11 Million, Fueled by Migration

Georgia State26 days ago

Australian police arrest hacker who created 'Evil Twin' wireless network to steal data during flights

Tom's Hardware3 days ago

5 Zodiac Signs That Could Be Your Worst Friend

Total Apex Sports & Entertainment21 hours ago

Fund Manager's Multimillion-Dollar Fraud Scheme

Newark, NJ21 days ago

Welcome to NewsBreak, an open platform where diverse perspectives converge. Most of our content comes from established publications and journalists, as well as from our extensive network of tens of thousands of creators who contribute to our platform. We empower individuals to share insightful viewpoints through short posts and comments. It’s essential to note our commitment to transparency: our Terms of Use acknowledge that our services may not always be error-free, and our Community Standards emphasize our discretion in enforcing policies. We strive to foster a dynamic environment for free expression and robust discourse through safety guardrails of human and AI moderation. Join us in shaping the news narrative together.

Comments / 0

Community Policy