These researchers used NPR Sunday Puzzle questions to benchmark AI ‘reasoning’ models

Topics

Latest

Amazon

Image Credits:NanoStockk / Getty Images

Apps

Biotech & Health

Climate

Robot humanoid uses laptop

Image Credits:NanoStockk / Getty Images

Cloud Computing

Commerce

Crypto

NPR benchmark

R1 getting “frustrated” on a question in the Sunday Puzzle challenge set.Image Credits:Guha et al.

go-ahead

EVs

Fintech

NPR benchmark

The scores of the models the team tested on their benchmark.Image Credits:Guha et al.

Fundraising

Gadgets

stake

Google

Government & Policy

Hardware

Instagram

Layoffs

Media & Entertainment

More from TechCrunch

event

Startup Battlefield

StrictlyVC

Podcasts

Videos

Partner Content

TechCrunch Brand Studio

Crunchboard

Every Sunday , NPR host Will Shortz , The New York Times ’ crossword puzzle guru , amaze to test grand of listeners in a long - running section called theSunday Puzzle . While write to be resolvable withouttoomuch foreknowledge , the brainteasers are unremarkably dispute even for skilled objector .

That ’s why some experts think they ’re a bright way to screen the limits of AI ’s job - solving power .

In arecent study , a squad of researchers come from Wellesley College , Oberlin College , the University of Texas at Austin , Northeastern University , Charles University , and startup Cursor make an AI benchmark using riddles from Sunday Puzzle episodes . The team says their test bring out surprising insights , like that abstract thought models — OpenAI ’s o1 , among others — sometimes “ give up ” and render response they have it away are n’t correct .

“ We desire to develop a benchmark with problems that humans can translate with only general cognition , ” Arjun Guha , a computer skill faculty phallus at Northeastern and one of the Colorado - writer on the study , told TechCrunch .

The AI industry is in a bit of a benchmarking quandary at the moment . Most of the test commonly used to evaluate AI models probe for attainment , like competency on PhD - degree math and scientific discipline motion , that are n’t relevant to the average drug user . Meanwhile , many benchmark — evenbenchmarks released relatively latterly — are quickly approaching the saturation pointedness .

The advantages of a public wireless quiz game like the Sunday Puzzle is that it does n’t test for esoteric cognition , and the challenges are give voice such that models ca n’t get out on “ rote storage ” to solve them , explicate Guha .

“ I think what makes these problems intemperately is that it ’s really difficult to make meaningful advancement on a job until you solve it — that ’s when everything clicks together all at once , ” Guha articulate . “ That require a combination of perceptivity and a process of voiding . ”

Join us at TechCrunch Sessions: AI

Exhibit at TechCrunch Sessions: AI

No bench mark is perfect , of course . The Sunday Puzzle is U.S. centric and English only . And because the quiz are in public available , it ’s possible that models train on them can “ cheat ” in a sense , although Guha says he has n’t escort evidence of this .

“ fresh inquiry are relinquish every week , and we can expect the latest questions to be truly unobserved , ” he added . “ We intend to keep the benchmark sweet and track how model carrying out changes over fourth dimension . ”

On the investigator ’ benchmark , which consists of around 600 Sunday Puzzle brain-teaser , logical thinking models such as o1 and DeepSeek ’s R1 far outperform the residue . logical thinking model exhaustively fact - check themselves before giving out results , which helps themavoid some of the pitfallsthat ordinarily trip up AI models . The patronage - off is that reasoning models take a little longer to arrive at solution — typically second to minute longer .

At least one simulation , DeepSeek ’s R1 , gives resolution it cognise to be wrong for some of the Sunday Puzzle questions . R1 will state verbatim “ I give up , ” espouse by an incorrect solution chosen apparently at random — behavior this human can certainly colligate to .

The poser make other bizarre pick , like yield a incorrect solution only to right away retract it , assay to tease out a better one , and fail again . They also get stuck “ cerebrate ” forever and give nonsensical explanation for answers , or they arrive at a correct response justly away but then go on to take alternative answers for no obvious reasonableness .

“ On hard problems , R1 literally says that it ’s getting ‘ frustrated , ’ ” Guha said . “ It was curious to see how a model emulate what a homo might say . It remains to be see how ‘ frustration ’ in abstract thought can affect the lineament of model results . ”

The current best - do model on the benchmark is o1 with a grudge of 59 % , followed by the late releasedo3 - miniset to in high spirits “ reasoning effort ” ( 47 % ) . ( R1 scored 35 % . ) As a next measure , the research worker plan to broaden their examination to additional abstract thought models , which they go for will help to identify arena where these models might be enhanced .

“ You do n’t need a PhD to be good at logical thinking , so it should be potential to plan abstract thought benchmarks that do n’t command PhD - level knowledge , ” Guha said . “ A bench mark with encompassing access allow a across-the-board set of researchers to comprehend and analyse the results , which may in twist direct to better solutions in the future . Furthermore , as state - of - the - graphics models are increasingly deploy in options that affect everyone , we believe everyone should be able to intuit what these models are — and are n’t — capable of . ”

Topics#

More from TechCrunch#

Join us at TechCrunch Sessions: AI#

Exhibit at TechCrunch Sessions: AI#

Topics

More from TechCrunch

Join us at TechCrunch Sessions: AI

Exhibit at TechCrunch Sessions: AI