OpenAI’s o3 AI model scores lower on a benchmark than the company initially implied

Topics

Latest

Amazon

Image Credits:Thomas Fuller / SOPA Images / LightRocket / Getty Images

Apps

Biotech & Health

Climate

Cloud Computing

Commerce

Crypto

endeavor

EVs

Fintech

fund raise

gizmo

Gaming

Google

Government & Policy

Hardware

Instagram

layoff

Media & Entertainment

More from TechCrunch

Events

Startup Battlefield

StrictlyVC

Podcasts

video

Partner Content

TechCrunch Brand Studio

Crunchboard

A discrepancy between first- and third - party bench mark results for OpenAI ’s o3 AI model israising interrogative about the company ’s transparencyand modelling testing practices .

When OpenAIunveiled o3 in December , the company claim the model could answer just over a quartern of questions on FrontierMath , a challenging set of math problems . That account blew the contention aside — the next - best model managed to answer only around 2 % of FrontierMath problem correctly .

“ Today , all offerings out there have less than 2 % [ on FrontierMath ] , ” Mark Chen , principal research officer at OpenAI , say during a livestream . “ We ’re seeing [ internally ] , with o3 in fast-growing test - time compute options , we ’re able to get over 25 % . ”

As it wrench out , that figure was likely an upper bound , reach by a interlingual rendition of o3 with more cypher behind it than the model OpenAI publicly launch last week .

Epoch AI , the research institute behind FrontierMath , free results of its independent bench mark tests of o3 on Friday . Epoch found that o3 score around 10 % , well below OpenAI ’s highest claimed mark .

OpenAI has turn o3 , their extremely anticipated abstract thought model , along with o4 - mini , a smaller and cheaper model that come after o3 - mini .

We evaluated the new models on our entourage of maths and skill benchmarks . upshot in thread!pic.twitter.com/5gbtzkEy1B

— Epoch AI ( @EpochAIResearch)April 18 , 2025

That does n’t mean OpenAI dwell , per se . The benchmark result the company release in December show a lower - confine mark that match the score Epoch observed . Epoch also note its examination apparatus likely differs from OpenAI ’s , and that it used an update exit of FrontierMath for its evaluations .

Join us at TechCrunch Sessions: AI

Exhibit at TechCrunch Sessions: AI

“ The dispute between our result and OpenAI ’s might be due to OpenAI assess with a more knock-down internal scaffold , using more psychometric test - time [ computing ] , or because those event were run on a dissimilar subset of FrontierMath ( the 180 problems in frontiermath-2024 - 11 - 26 vs the 290 problems in frontiermath-2025 - 02 - 28 - private),”wroteEpoch .

harmonize to a military post on Xfrom the ARC Prize Foundation , an organisation that screen a prerelease version of o3 , the public o3 model “ is a different model [ … ] tuned for Old World chat / product use , ” corroborating Epoch ’s report .

“ All released o3 compute tier up are smaller than the interpretation we [ benchmarked ] , ” drop a line ARC Prize . Generally speaking , bigger compute tier can be expected to achieve better benchmark scores .

Re - examination released o3 on ARC - AGI-1 will take a day or two . Because today ’s expiration is a materially dissimilar system , we are re - labeling our past reported results as “ trailer ” :

o3 - preview ( low ): 75.7 % , $ 200 / tasko3 - preview ( high ): 87.5 % , $ 34.4k / chore

Above uses o1 pro pricing …

— Mike Knoop ( @mikeknoop)April 16 , 2025

OpenAI ’s own Wenda Zhou , a member of the technical staff , said during a livestream last weekthat the o3 in output is “ more optimized for real - humankind use case ” and speed versus the edition of o3 demo in December . As a result , it may demo benchmark “ disparities , ” he added .

“ [ W]e’ve done [ optimizations ] to make the [ model ] more monetary value - efficient [ and ] more useful in general , ” Zhou said . “ We still go for that — we still think that — this is a much better model [ … ] You wo n’t have to wait as long when you ’re ask for an answer , which is a real thing with these [ types of ] example . ”

Granted , the fact that the public exit of o3 falls short of OpenAI ’s examination promises is a piece of a arguable point , since the ship’s company ’s o3 - mini - high and o4 - miniskirt framework outperform o3 on FrontierMath , and OpenAI plans to debut a more powerful o3 variant , o3 - pro , in the come weeks .

It is , however , another reminder that AI benchmarks are best not get at face time value — peculiarly when the source is a company with services to sell .

Benchmarking “ controversies ” are becoming a common occurrent in the AI industry as vendors step on it to charm newspaper headline and mindshare with new simulation .

In January , Epoch wascriticizedfor waiting to disclose funding from OpenAI until after the company announced o3 . Many faculty member who contributed to FrontierMath were n’t informed of OpenAI ’s liaison until it was made public .

More of late , Elon Musk ’s xAI wasaccusedof publishing misleading benchmark chart for its later AI mannikin , Grok 3 . Just this calendar month , Meta admitted to touting bench mark sexual conquest for a version ofa model that differed from the one the party made available to developer .

Topics#

More from TechCrunch#

Join us at TechCrunch Sessions: AI#

Exhibit at TechCrunch Sessions: AI#

Topics

More from TechCrunch

Join us at TechCrunch Sessions: AI

Exhibit at TechCrunch Sessions: AI