Topics

Latest

AI

Amazon

Article image

Image Credits:Thomas Fuller / SOPA Images / LightRocket / Getty Images

Apps

Biotech & Health

Climate

Cloud Computing

Commerce

Crypto

endeavor

EVs

Fintech

fund raise

gizmo

Gaming

Google

Government & Policy

Hardware

Instagram

layoff

Media & Entertainment

Meta

Microsoft

Privacy

Robotics

security department

societal

outer space

Startups

TikTok

conveyance

Venture

More from TechCrunch

Events

Startup Battlefield

StrictlyVC

Podcasts

video

Partner Content

TechCrunch Brand Studio

Crunchboard

Contact Us

A discrepancy between first- and third - party bench mark results for OpenAI ’s o3 AI model israising interrogative about the company ’s transparencyand modelling testing practices .

When OpenAIunveiled o3 in December , the company claim the model could answer just over a quartern of questions on FrontierMath , a challenging set of math problems . That account blew the contention aside — the next - best model managed to answer only around 2 % of FrontierMath problem correctly .

“ Today , all offerings out there have less than 2 % [ on FrontierMath ] , ” Mark Chen , principal research officer at OpenAI , say during a livestream . “ We ’re seeing [ internally ] , with o3 in fast-growing test - time compute options , we ’re able to get over 25 % . ”

As it wrench out , that figure was likely an upper bound , reach by a interlingual rendition of o3 with more cypher behind it than the model OpenAI publicly launch last week .

Epoch AI , the research institute behind FrontierMath , free results of its independent bench mark tests of o3 on Friday . Epoch found that o3 score around 10 % , well below OpenAI ’s highest claimed mark .

OpenAI has turn o3 , their extremely anticipated abstract thought model , along with o4 - mini , a smaller and cheaper model that come after o3 - mini .

We evaluated the new models on our entourage of maths and skill benchmarks . upshot in thread!pic.twitter.com/5gbtzkEy1B

— Epoch AI ( @EpochAIResearch)April 18 , 2025

That does n’t mean OpenAI dwell , per se . The benchmark result the company release in December show a lower - confine mark that match the score Epoch observed . Epoch also note its examination apparatus likely differs from OpenAI ’s , and that it used an update exit of FrontierMath for its evaluations .

Join us at TechCrunch Sessions: AI

Exhibit at TechCrunch Sessions: AI

“ The dispute between our result and OpenAI ’s might be due to OpenAI assess with a more knock-down internal scaffold , using more psychometric test - time [ computing ] , or because those event were run on a dissimilar subset of FrontierMath ( the 180 problems in frontiermath-2024 - 11 - 26 vs the 290 problems in frontiermath-2025 - 02 - 28 - private),”wroteEpoch .

harmonize to a military post on Xfrom the ARC Prize Foundation , an organisation that screen a prerelease version of o3 , the public o3 model “ is a different model [ … ] tuned for Old World chat / product use , ” corroborating Epoch ’s report .

“ All released o3 compute tier up are smaller than the interpretation we [ benchmarked ] , ” drop a line ARC Prize . Generally speaking , bigger compute tier can be expected to achieve better benchmark scores .

Re - examination released o3 on ARC - AGI-1 will take a day or two . Because today ’s expiration is a materially dissimilar system , we are re - labeling our past reported results as “ trailer ” :

o3 - preview ( low ): 75.7 % , $ 200 / tasko3 - preview ( high ): 87.5 % , $ 34.4k / chore

Above uses o1 pro pricing …

— Mike Knoop ( @mikeknoop)April 16 , 2025

OpenAI ’s own Wenda Zhou , a member of the technical staff , said during a livestream last weekthat the o3 in output is “ more optimized for real - humankind use case ” and speed versus the edition of o3 demo in December . As a result , it may demo benchmark “ disparities , ” he added .

“ [ W]e’ve done [ optimizations ] to make the [ model ] more monetary value - efficient [ and ] more useful in general , ” Zhou said . “ We still go for that — we still think that — this is a much better model [ … ] You wo n’t have to wait as long when you ’re ask for an answer , which is a real thing with these [ types of ] example . ”

Granted , the fact that the public exit of o3 falls short of OpenAI ’s examination promises is a piece of a arguable point , since the ship’s company ’s o3 - mini - high and o4 - miniskirt framework outperform o3 on FrontierMath , and OpenAI plans to debut a more powerful o3 variant , o3 - pro , in the come weeks .

It is , however , another reminder that AI benchmarks are best not get at face time value — peculiarly when the source is a company with services to sell .

Benchmarking “ controversies ” are becoming a common occurrent in the AI industry as vendors step on it to charm newspaper headline and mindshare with new simulation .

In January , Epoch wascriticizedfor waiting to disclose funding from OpenAI until after the company announced o3 . Many faculty member who contributed to FrontierMath were n’t informed of OpenAI ’s liaison until it was made public .

More of late , Elon Musk ’s xAI wasaccusedof publishing misleading benchmark chart for its later AI mannikin , Grok 3 . Just this calendar month , Meta admitted to touting bench mark sexual conquest for a version ofa model that differed from the one the party made available to developer .