Topics

Latest

AI

Amazon

Article image

Image Credits:Jaap Arriens/NurPhoto(opens in a new window)/ Getty Images

Apps

Biotech & Health

Climate

Cloud Computing

Commerce

Crypto

Enterprise

EVs

Fintech

fund raise

Gadgets

bet on

Google

Government & Policy

ironware

Instagram

Layoffs

Media & Entertainment

Meta

Microsoft

seclusion

Robotics

Security

Social

Space

startup

TikTok

DoT

Venture

More from TechCrunch

Events

Startup Battlefield

StrictlyVC

newssheet

Podcasts

Videos

Partner Content

TechCrunch Brand Studio

Crunchboard

Contact Us

Debates over AI bench mark — and how they ’re report by AI labs — are spilling out into public view .

This week , an OpenAI employeeaccusedElon Musk ’s AI company , xAI , of publishing misleading bench mark results for its late AI model , Grok 3 . One of the co - founders of xAI , Igor Babuschkin , insistedthat the company was in the right .

The truth lie somewhere in between .

In apost on xAI ’s web log , the company print a graph showing Grok 3 ’s functioning on AIME 2025 , a collection of challenging math questions from a recent invitational math exam . Some experts havequestioned AIME ’s lustiness as an AI bench mark . Nevertheless , AIME 2025 and older version of the test are usually used to examine a manikin ’s maths power .

xAI ’s graph prove two variants of Grok 3 , Grok 3 Reasoning Beta and Grok 3 mini Reasoning , circumvent OpenAI ’s best - performing available manikin , o3 - mini - high , on AIME 2025 . But OpenAI employees on X were quick to indicate out that xAI ’s graph did n’t admit o3 - mini - high ’s AIME 2025 score at “ cons@64 . ”

What is cons@64 , you might ask ? Well , it ’s short for “ consensus@64 , ” and it basically gives a model 64 effort to do each trouble in a bench mark and take the response return most frequently as the final answers . As you’re able to suppose , cons@64 tends to boost models ’ bench mark rafts quite a bite , and omitting it from a graph might make it appear as though one theoretical account transcend another when in realness , that is n’t the case .

Grok 3 Reasoning Beta and Grok 3 mini Reasoning ’s scores for AIME 2025 at “ @1 ” — mean the first grade the exemplar got on the bench mark — precipitate below o3 - mini - eminent ’s score . Grok 3 Reasoning Beta also trails ever so slenderly behind OpenAI’so1 modelset to “ intermediate ” computation . Yet xAI isadvertising Grok 3as the “ world ’s smartest AI . ”

Join us at TechCrunch Sessions: AI

Exhibit at TechCrunch Sessions: AI

Babuschkinargued on Xthat OpenAI has published likewise misleading benchmark chart in the past times — albeit charts compare the performance of its own models . A more impersonal company in the debate put together a more “ precise ” graph show almost every model ’s performance at cons@64 :

Hilarious how some people see my secret plan as attack on OpenAI and others as attack on Grok while in reality it ’s DeepSeek propaganda(I actually believe Grok look good there , and openAI ’s TTC chicanery behind o3 - mini-high-pass@”””1″ ” ” deserves more scrutiny.)https://t.co / dJqlJpcJh8pic.twitter.com/3WH8FOUfic

— Teortaxes ▶ ️ ( DeepSeek 推特 🐋 铁粉 2023 – ∞ ) ( @teortaxesTex)February 20 , 2025

But as AI researcher Nathan Lambertpointed out in a post , perhaps the most important metric remains a mystery : the computational ( and monetary ) cost it take for each model to achieve its best scotch . That just run short to show how little most AI benchmark communicate about models ’ limit — and their forcefulness .