Topics
Latest
AI
Amazon
Image Credits:Jaap Arriens/NurPhoto(opens in a new window)/ Getty Images
Apps
Biotech & Health
Climate
Cloud Computing
Commerce
Crypto
Enterprise
EVs
Fintech
fund raise
Gadgets
bet on
Government & Policy
ironware
Layoffs
Media & Entertainment
Meta
Microsoft
seclusion
Robotics
Security
Social
Space
startup
TikTok
DoT
Venture
More from TechCrunch
Events
Startup Battlefield
StrictlyVC
newssheet
Podcasts
Videos
Partner Content
TechCrunch Brand Studio
Crunchboard
Contact Us
Debates over AI bench mark — and how they ’re report by AI labs — are spilling out into public view .
This week , an OpenAI employeeaccusedElon Musk ’s AI company , xAI , of publishing misleading bench mark results for its late AI model , Grok 3 . One of the co - founders of xAI , Igor Babuschkin , insistedthat the company was in the right .
The truth lie somewhere in between .
In apost on xAI ’s web log , the company print a graph showing Grok 3 ’s functioning on AIME 2025 , a collection of challenging math questions from a recent invitational math exam . Some experts havequestioned AIME ’s lustiness as an AI bench mark . Nevertheless , AIME 2025 and older version of the test are usually used to examine a manikin ’s maths power .
xAI ’s graph prove two variants of Grok 3 , Grok 3 Reasoning Beta and Grok 3 mini Reasoning , circumvent OpenAI ’s best - performing available manikin , o3 - mini - high , on AIME 2025 . But OpenAI employees on X were quick to indicate out that xAI ’s graph did n’t admit o3 - mini - high ’s AIME 2025 score at “ cons@64 . ”
What is cons@64 , you might ask ? Well , it ’s short for “ consensus@64 , ” and it basically gives a model 64 effort to do each trouble in a bench mark and take the response return most frequently as the final answers . As you’re able to suppose , cons@64 tends to boost models ’ bench mark rafts quite a bite , and omitting it from a graph might make it appear as though one theoretical account transcend another when in realness , that is n’t the case .
Grok 3 Reasoning Beta and Grok 3 mini Reasoning ’s scores for AIME 2025 at “ @1 ” — mean the first grade the exemplar got on the bench mark — precipitate below o3 - mini - eminent ’s score . Grok 3 Reasoning Beta also trails ever so slenderly behind OpenAI’so1 modelset to “ intermediate ” computation . Yet xAI isadvertising Grok 3as the “ world ’s smartest AI . ”
Join us at TechCrunch Sessions: AI
Exhibit at TechCrunch Sessions: AI
Babuschkinargued on Xthat OpenAI has published likewise misleading benchmark chart in the past times — albeit charts compare the performance of its own models . A more impersonal company in the debate put together a more “ precise ” graph show almost every model ’s performance at cons@64 :
Hilarious how some people see my secret plan as attack on OpenAI and others as attack on Grok while in reality it ’s DeepSeek propaganda(I actually believe Grok look good there , and openAI ’s TTC chicanery behind o3 - mini-high-pass@”””1″ ” ” deserves more scrutiny.)https://t.co / dJqlJpcJh8pic.twitter.com/3WH8FOUfic
— Teortaxes ▶ ️ ( DeepSeek 推特 🐋 铁粉 2023 – ∞ ) ( @teortaxesTex)February 20 , 2025
But as AI researcher Nathan Lambertpointed out in a post , perhaps the most important metric remains a mystery : the computational ( and monetary ) cost it take for each model to achieve its best scotch . That just run short to show how little most AI benchmark communicate about models ’ limit — and their forcefulness .