A new, challenging AGI test stumps most AI models

Topics

Latest

Amazon

Image Credits:Boris SV / Getty Images

Apps

Biotech & Health

Climate

Image Credits:Boris SV / Getty Images

Cloud Computing

Commerce

Crypto

a sample question from Arc-AGI-2.Image Credits:Arc Prize

initiative

EVs

Fintech

Comparison of Frontier AI model performance on ARC-AGI-1 and ARC-AGI-2.Image Credits:Arc Prize

fund-raise

convenience

Gaming

Google

Government & Policy

Hardware

Instagram

layoff

Media & Entertainment

More from TechCrunch

case

Startup Battlefield

StrictlyVC

Podcasts

Videos

Partner Content

TechCrunch Brand Studio

Crunchboard

The Arc Prize Foundation , a nonprofit atomic number 27 - founded by salient AI researcher François Chollet , announce in ablog poston Monday that it has created a new , intriguing test to measure the general intelligence of leading AI models .

So far , the new run , call ARC - AGI-2 , has stump most models .

“ Reasoning ” AI modelling like OpenAI ’s o1 - pro and DeepSeek ’s R1 grade between 1 % and 1.3 % on ARC - AGI-2 , according to theArc Prize leaderboard . herculean non - thinking models , include GPT-4.5 , Claude 3.7 Sonnet , and Gemini 2.0 Flash , score around 1 % .

The ARC - AGI run consist of puzzler - like problems where an AI has to identify visual patterns from a appeal of unlike - colored public square and generate the correct “ answer ” control grid . The job were designed to force an AI to adapt to novel problems it has n’t seen before .

The Arc Prize Foundation had over 400 people take ARC - AGI-2 to establish a human service line . On average , “ panels ” of these mass got 60 % of the test ’s questions right — much unspoiled than any of the models ’ scores .

In apost on X , Chollet take ARC - AGI-2 is a better standard of an AI model ’s actual intelligence operation than the first iteration of the test , ARC - AGI-1 . The Arc Prize Foundation ’s tests are aimed at evaluating whether an AI scheme can efficiently acquire new skills outside the data it was trained on .

Chollet said that unlike ARC - AGI-1 , the raw mental testing prevents AI models from relying on “ brute violence ” — extensive computing top executive — to find solutions . Chollet previously acknowledgedthis was a major fault of ARC - AGI-1 .

Join us at TechCrunch Sessions: AI

Exhibit at TechCrunch Sessions: AI

To handle the first run ’s flaws , ARC - AGI-2 introduces a newfangled metrical : efficiency . It also take models to interpret patterns on the tent flap rather of bank on memorization .

“ Intelligence is not solely defined by the ability to solve problem or attain high scores , ” Arc Prize Foundation co - founding father Greg Kamradt write in ablog post . “ The efficiency with which those capabilities are acquired and deployed is a crucial , define component . The core question being asked is not just , ‘ Can AI produce [ the ] attainment to solve a task ? ’ but also , ‘ At what efficiency or cost ? ’ ”

ARC - AGI-1 was unbeaten for roughly five years until December 2024 , when OpenAI liberate itsadvanced reasoning model , o3 , which outperformed all other AI models and matched human performance on the rating . However , as we note at the time , o3 ’s performance gains on ARC - AGI-1 hail with a hefty price tag .

The version of OpenAI ’s o3 simulation — o3 ( down in the mouth ) — that was first to reach novel heights on ARC - AGI-1 , scoring 75.7 % on the trial , got a measly 4 % on ARC - AGI-2 using $ 200 worth of reckon power per task .

The arriver of ARC - AGI-2 total as many in the technical school manufacture are call for unexampled , unsaturated benchmark to evaluate AI procession . Hugging Face ’s Centennial State - founder , Thomas Wolf , recently tell TechCrunch thatthe AI manufacture miss sufficient trial to measure the key traits of hokey general intelligence information , include creativity .

Alongside the new benchmark , the Arc Prize Foundation announceda new Arc Prize 2025 competition , challenging developer to arrive at 85 % accuracy on the ARC - AGI-2 test while only spending $ 0.42 per project .

Topics#

More from TechCrunch#

Join us at TechCrunch Sessions: AI#

Exhibit at TechCrunch Sessions: AI#

Topics

More from TechCrunch

Join us at TechCrunch Sessions: AI

Exhibit at TechCrunch Sessions: AI