Meta’s benchmarks for its new AI models are a bit misleading

Topics

Latest

Amazon

Image Credits:Kelly Sullivan / Getty Images

Apps

Biotech & Health

Climate

Meta sign

Image Credits:Kelly Sullivan / Getty Images

Cloud Computing

commercialism

Crypto

Enterprise

EVs

Fintech

fundraise

gismo

Gaming

Google

Government & Policy

Hardware

Instagram

layoff

Media & Entertainment

More from TechCrunch

Events

Startup Battlefield

StrictlyVC

Podcasts

picture

Partner Content

TechCrunch Brand Studio

Crunchboard

One of thenew flagship AI modelsMeta released on Saturday , Maverick , ranks secondly on LM Arena , a trial that has human raters compare the output of theoretical account and prefer which they prefer . But it seems the interpretation of Maverick that Meta deployed to LM Arena disagree from the version that ’s widely available to developer .

AsseveralAIresearcherspointed out on X , Meta observe in its annunciation that the Maverick on LM Arena is an “ experimental chat version . ” A chart on theofficial Llama internet site , meanwhile , discloses that Meta ’s LM Arena testing was conducted using “ Llama 4 Maverick optimized for conversationality . ”

As we ’ve written about before , for various reasons , LM Arena has never been the most reliable measure of an AI model ’s performance . But AI companies generally have n’t customized or otherwise OK - tune their models to score better on LM Arena — or have n’t admit to doing so , at least .

The problem with tailoring a model to a bench mark , withholding it , and then exhaust a “ vanilla ” variant of that same manakin is that it constitute it take exception for developer to predict exactly how well the manikin will do in particular contexts . It ’s also misleading . Ideally , benchmarks — woefully short as they are — provide a snapshot of a exclusive mannikin ’s forcefulness and weaknesses across a scope of undertaking .

Indeed , research worker on hug drug haveobserved starkdifferences in the behaviorof the publicly downloadable rebel compare with the exemplar host on LM Arena . The LM Arena version seems to habituate a flock of emojis , and give incredibly long - winded answers .

Okay Llama 4 is def a littled cooked lol , what is this yap citypic.twitter.com/y3GvhbVz65

— Nathan Lambert ( @natolambert)April 6 , 2025

for some reason , the Llama 4 model in Arena uses a lot more Emojis

on together . ai , it seems better : pic.twitter.com / f74ODX4zTt

— Tech Dev Notes ( @techdevnotes)April 6 , 2025

Join us at TechCrunch Sessions: AI

Exhibit at TechCrunch Sessions: AI

We ’ve reached out to Meta and Chatbot Arena , the organization that maintains LM Arena , for commentary .

Topics#

More from TechCrunch#

Join us at TechCrunch Sessions: AI#

Exhibit at TechCrunch Sessions: AI#

Topics

More from TechCrunch

Join us at TechCrunch Sessions: AI

Exhibit at TechCrunch Sessions: AI