Topics
Latest
AI
Amazon
Image Credits:Kelly Sullivan / Getty Images
Apps
Biotech & Health
Climate
Image Credits:Kelly Sullivan / Getty Images
Cloud Computing
commercialism
Crypto
Enterprise
EVs
Fintech
fundraise
gismo
Gaming
Government & Policy
Hardware
layoff
Media & Entertainment
Meta
Microsoft
Privacy
Robotics
Security
Social
Space
Startups
TikTok
Transportation
speculation
More from TechCrunch
Events
Startup Battlefield
StrictlyVC
Podcasts
picture
Partner Content
TechCrunch Brand Studio
Crunchboard
Contact Us
One of thenew flagship AI modelsMeta released on Saturday , Maverick , ranks secondly on LM Arena , a trial that has human raters compare the output of theoretical account and prefer which they prefer . But it seems the interpretation of Maverick that Meta deployed to LM Arena disagree from the version that ’s widely available to developer .
AsseveralAIresearcherspointed out on X , Meta observe in its annunciation that the Maverick on LM Arena is an “ experimental chat version . ” A chart on theofficial Llama internet site , meanwhile , discloses that Meta ’s LM Arena testing was conducted using “ Llama 4 Maverick optimized for conversationality . ”
As we ’ve written about before , for various reasons , LM Arena has never been the most reliable measure of an AI model ’s performance . But AI companies generally have n’t customized or otherwise OK - tune their models to score better on LM Arena — or have n’t admit to doing so , at least .
The problem with tailoring a model to a bench mark , withholding it , and then exhaust a “ vanilla ” variant of that same manakin is that it constitute it take exception for developer to predict exactly how well the manikin will do in particular contexts . It ’s also misleading . Ideally , benchmarks — woefully short as they are — provide a snapshot of a exclusive mannikin ’s forcefulness and weaknesses across a scope of undertaking .
Indeed , research worker on hug drug haveobserved starkdifferences in the behaviorof the publicly downloadable rebel compare with the exemplar host on LM Arena . The LM Arena version seems to habituate a flock of emojis , and give incredibly long - winded answers .
Okay Llama 4 is def a littled cooked lol , what is this yap citypic.twitter.com/y3GvhbVz65
— Nathan Lambert ( @natolambert)April 6 , 2025
for some reason , the Llama 4 model in Arena uses a lot more Emojis
on together . ai , it seems better : pic.twitter.com / f74ODX4zTt
— Tech Dev Notes ( @techdevnotes)April 6 , 2025
Join us at TechCrunch Sessions: AI
Exhibit at TechCrunch Sessions: AI
We ’ve reached out to Meta and Chatbot Arena , the organization that maintains LM Arena , for commentary .