Topics
Latest
AI
Amazon
Image Credits:Thomas Fuller / SOPA Images / LightRocket / Getty Images
Apps
Biotech & Health
Climate
Cloud Computing
Commerce
Crypto
endeavor
EVs
Fintech
fund raise
gizmo
Gaming
Government & Policy
Hardware
layoff
Media & Entertainment
Meta
Microsoft
Privacy
Robotics
security department
societal
outer space
Startups
TikTok
conveyance
Venture
More from TechCrunch
Events
Startup Battlefield
StrictlyVC
Podcasts
video
Partner Content
TechCrunch Brand Studio
Crunchboard
Contact Us
A discrepancy between first- and third - party bench mark results for OpenAI ’s o3 AI model israising interrogative about the company ’s transparencyand modelling testing practices .
When OpenAIunveiled o3 in December , the company claim the model could answer just over a quartern of questions on FrontierMath , a challenging set of math problems . That account blew the contention aside — the next - best model managed to answer only around 2 % of FrontierMath problem correctly .
“ Today , all offerings out there have less than 2 % [ on FrontierMath ] , ” Mark Chen , principal research officer at OpenAI , say during a livestream . “ We ’re seeing [ internally ] , with o3 in fast-growing test - time compute options , we ’re able to get over 25 % . ”
As it wrench out , that figure was likely an upper bound , reach by a interlingual rendition of o3 with more cypher behind it than the model OpenAI publicly launch last week .
Epoch AI , the research institute behind FrontierMath , free results of its independent bench mark tests of o3 on Friday . Epoch found that o3 score around 10 % , well below OpenAI ’s highest claimed mark .
OpenAI has turn o3 , their extremely anticipated abstract thought model , along with o4 - mini , a smaller and cheaper model that come after o3 - mini .
We evaluated the new models on our entourage of maths and skill benchmarks . upshot in thread!pic.twitter.com/5gbtzkEy1B
— Epoch AI ( @EpochAIResearch)April 18 , 2025
That does n’t mean OpenAI dwell , per se . The benchmark result the company release in December show a lower - confine mark that match the score Epoch observed . Epoch also note its examination apparatus likely differs from OpenAI ’s , and that it used an update exit of FrontierMath for its evaluations .
Join us at TechCrunch Sessions: AI
Exhibit at TechCrunch Sessions: AI
“ The dispute between our result and OpenAI ’s might be due to OpenAI assess with a more knock-down internal scaffold , using more psychometric test - time [ computing ] , or because those event were run on a dissimilar subset of FrontierMath ( the 180 problems in frontiermath-2024 - 11 - 26 vs the 290 problems in frontiermath-2025 - 02 - 28 - private),”wroteEpoch .
harmonize to a military post on Xfrom the ARC Prize Foundation , an organisation that screen a prerelease version of o3 , the public o3 model “ is a different model [ … ] tuned for Old World chat / product use , ” corroborating Epoch ’s report .
“ All released o3 compute tier up are smaller than the interpretation we [ benchmarked ] , ” drop a line ARC Prize . Generally speaking , bigger compute tier can be expected to achieve better benchmark scores .
Re - examination released o3 on ARC - AGI-1 will take a day or two . Because today ’s expiration is a materially dissimilar system , we are re - labeling our past reported results as “ trailer ” :
o3 - preview ( low ): 75.7 % , $ 200 / tasko3 - preview ( high ): 87.5 % , $ 34.4k / chore
Above uses o1 pro pricing …
— Mike Knoop ( @mikeknoop)April 16 , 2025
OpenAI ’s own Wenda Zhou , a member of the technical staff , said during a livestream last weekthat the o3 in output is “ more optimized for real - humankind use case ” and speed versus the edition of o3 demo in December . As a result , it may demo benchmark “ disparities , ” he added .
“ [ W]e’ve done [ optimizations ] to make the [ model ] more monetary value - efficient [ and ] more useful in general , ” Zhou said . “ We still go for that — we still think that — this is a much better model [ … ] You wo n’t have to wait as long when you ’re ask for an answer , which is a real thing with these [ types of ] example . ”
Granted , the fact that the public exit of o3 falls short of OpenAI ’s examination promises is a piece of a arguable point , since the ship’s company ’s o3 - mini - high and o4 - miniskirt framework outperform o3 on FrontierMath , and OpenAI plans to debut a more powerful o3 variant , o3 - pro , in the come weeks .
It is , however , another reminder that AI benchmarks are best not get at face time value — peculiarly when the source is a company with services to sell .
Benchmarking “ controversies ” are becoming a common occurrent in the AI industry as vendors step on it to charm newspaper headline and mindshare with new simulation .
In January , Epoch wascriticizedfor waiting to disclose funding from OpenAI until after the company announced o3 . Many faculty member who contributed to FrontierMath were n’t informed of OpenAI ’s liaison until it was made public .
More of late , Elon Musk ’s xAI wasaccusedof publishing misleading benchmark chart for its later AI mannikin , Grok 3 . Just this calendar month , Meta admitted to touting bench mark sexual conquest for a version ofa model that differed from the one the party made available to developer .