Topics

Latest

AI

Amazon

Article image

Image Credits:NanoStockk / Getty Images

Apps

Biotech & Health

Climate

Robot humanoid uses laptop

Image Credits:NanoStockk / Getty Images

Cloud Computing

Commerce

Crypto

NPR benchmark

R1 getting “frustrated” on a question in the Sunday Puzzle challenge set.Image Credits:Guha et al.

go-ahead

EVs

Fintech

NPR benchmark

The scores of the models the team tested on their benchmark.Image Credits:Guha et al.

Fundraising

Gadgets

stake

Google

Government & Policy

Hardware

Instagram

Layoffs

Media & Entertainment

Meta

Microsoft

Privacy

Robotics

surety

Social

outer space

inauguration

TikTok

conveyance

speculation

More from TechCrunch

event

Startup Battlefield

StrictlyVC

Podcasts

Videos

Partner Content

TechCrunch Brand Studio

Crunchboard

Contact Us

Every Sunday , NPR host Will Shortz , The New York Times ’ crossword puzzle guru , amaze to test grand of listeners in a long - running section called theSunday Puzzle . While write to be resolvable withouttoomuch foreknowledge , the brainteasers are unremarkably dispute even for skilled objector .

That ’s why some experts think they ’re a bright way to screen the limits of AI ’s job - solving power .

In arecent study , a squad of researchers come from Wellesley College , Oberlin College , the University of Texas at Austin , Northeastern University , Charles University , and startup Cursor make an AI benchmark using riddles from Sunday Puzzle episodes . The team says their test bring out surprising insights , like that abstract thought models — OpenAI ’s o1 , among others — sometimes “ give up ” and render response they have it away are n’t correct .

“ We desire to develop a benchmark with problems that humans can translate with only general cognition , ” Arjun Guha , a computer skill faculty phallus at Northeastern and one of the Colorado - writer on the study , told TechCrunch .

The AI industry is in a bit of a benchmarking quandary at the moment . Most of the test commonly used to evaluate AI models probe for attainment , like competency on PhD - degree math and scientific discipline motion , that are n’t relevant to the average drug user . Meanwhile , many benchmark — evenbenchmarks released relatively latterly — are quickly approaching the saturation pointedness .

The advantages of a public wireless quiz game like the Sunday Puzzle is that it does n’t test for esoteric cognition , and the challenges are give voice such that models ca n’t get out on “ rote storage ” to solve them , explicate Guha .

“ I think what makes these problems intemperately is that it ’s really difficult to make meaningful advancement on a job until you solve it — that ’s when everything clicks together all at once , ” Guha articulate . “ That require a combination of perceptivity and a process of voiding . ”

Join us at TechCrunch Sessions: AI

Exhibit at TechCrunch Sessions: AI

No bench mark is perfect , of course . The Sunday Puzzle is U.S. centric and English only . And because the quiz are in public available , it ’s possible that models train on them can “ cheat ” in a sense , although Guha says he has n’t escort evidence of this .

“ fresh inquiry are relinquish every week , and we can expect the latest questions to be truly unobserved , ” he added . “ We intend to keep the benchmark sweet and track how model carrying out changes over fourth dimension . ”

On the investigator ’ benchmark , which consists of around 600 Sunday Puzzle   brain-teaser , logical thinking models such as o1 and DeepSeek ’s R1 far outperform the residue . logical thinking model exhaustively fact - check themselves before giving out results , which   helps themavoid some of the   pitfallsthat ordinarily trip up AI models . The patronage - off is that reasoning models take a little longer to arrive at solution — typically second to minute longer .

At least one simulation , DeepSeek ’s R1 , gives resolution it cognise to be wrong for some of the Sunday Puzzle questions . R1 will state verbatim “ I give up , ” espouse by an incorrect solution chosen apparently at random — behavior this human can certainly colligate to .

The poser make other bizarre pick , like yield a incorrect solution only to right away retract it , assay to tease out a better one , and fail again . They also get stuck “ cerebrate ” forever and give nonsensical explanation for answers , or they arrive at a correct response justly away but then go on to take alternative answers for no obvious reasonableness .

“ On hard problems , R1 literally says that it ’s getting ‘ frustrated , ’ ” Guha said . “ It was curious to see how a model emulate what a homo might say . It remains to be see how ‘ frustration ’ in abstract thought can affect the lineament of model results . ”

The current best - do model on the benchmark is o1 with a grudge of 59 % , followed by the late releasedo3 - miniset to in high spirits “ reasoning effort ” ( 47 % ) . ( R1 scored 35 % . ) As a next measure , the research worker plan to broaden their examination to additional abstract thought models , which they go for will help to identify arena where these models might be enhanced .

“ You do n’t need a PhD to be good at logical thinking , so it should be potential to plan abstract thought benchmarks that do n’t command PhD - level knowledge , ” Guha said . “ A bench mark with encompassing access allow a across-the-board set of researchers to comprehend and analyse the results , which may in twist direct to better solutions in the future . Furthermore , as state - of - the - graphics models are increasingly deploy in options that affect everyone , we believe everyone should be able to intuit what these models are — and are n’t — capable of . ”