Topics
Latest
AI
Amazon
Image Credits:NanoStockk / Getty Images
Apps
Biotech & Health
Climate
Image Credits:NanoStockk / Getty Images
Cloud Computing
Commerce
Crypto
R1 getting “frustrated” on a question in the Sunday Puzzle challenge set.Image Credits:Guha et al.
go-ahead
EVs
Fintech
The scores of the models the team tested on their benchmark.Image Credits:Guha et al.
Fundraising
Gadgets
stake
Government & Policy
Hardware
Layoffs
Media & Entertainment
Meta
Microsoft
Privacy
Robotics
surety
Social
outer space
inauguration
TikTok
conveyance
speculation
More from TechCrunch
event
Startup Battlefield
StrictlyVC
Podcasts
Videos
Partner Content
TechCrunch Brand Studio
Crunchboard
Contact Us
Every Sunday , NPR host Will Shortz , The New York Times ’ crossword puzzle guru , amaze to test grand of listeners in a long - running section called theSunday Puzzle . While write to be resolvable withouttoomuch foreknowledge , the brainteasers are unremarkably dispute even for skilled objector .
That ’s why some experts think they ’re a bright way to screen the limits of AI ’s job - solving power .
In arecent study , a squad of researchers come from Wellesley College , Oberlin College , the University of Texas at Austin , Northeastern University , Charles University , and startup Cursor make an AI benchmark using riddles from Sunday Puzzle episodes . The team says their test bring out surprising insights , like that abstract thought models — OpenAI ’s o1 , among others — sometimes “ give up ” and render response they have it away are n’t correct .
“ We desire to develop a benchmark with problems that humans can translate with only general cognition , ” Arjun Guha , a computer skill faculty phallus at Northeastern and one of the Colorado - writer on the study , told TechCrunch .
The AI industry is in a bit of a benchmarking quandary at the moment . Most of the test commonly used to evaluate AI models probe for attainment , like competency on PhD - degree math and scientific discipline motion , that are n’t relevant to the average drug user . Meanwhile , many benchmark — evenbenchmarks released relatively latterly — are quickly approaching the saturation pointedness .
The advantages of a public wireless quiz game like the Sunday Puzzle is that it does n’t test for esoteric cognition , and the challenges are give voice such that models ca n’t get out on “ rote storage ” to solve them , explicate Guha .
“ I think what makes these problems intemperately is that it ’s really difficult to make meaningful advancement on a job until you solve it — that ’s when everything clicks together all at once , ” Guha articulate . “ That require a combination of perceptivity and a process of voiding . ”
Join us at TechCrunch Sessions: AI
Exhibit at TechCrunch Sessions: AI
No bench mark is perfect , of course . The Sunday Puzzle is U.S. centric and English only . And because the quiz are in public available , it ’s possible that models train on them can “ cheat ” in a sense , although Guha says he has n’t escort evidence of this .
“ fresh inquiry are relinquish every week , and we can expect the latest questions to be truly unobserved , ” he added . “ We intend to keep the benchmark sweet and track how model carrying out changes over fourth dimension . ”
On the investigator ’ benchmark , which consists of around 600 Sunday Puzzle brain-teaser , logical thinking models such as o1 and DeepSeek ’s R1 far outperform the residue . logical thinking model exhaustively fact - check themselves before giving out results , which helps themavoid some of the pitfallsthat ordinarily trip up AI models . The patronage - off is that reasoning models take a little longer to arrive at solution — typically second to minute longer .
At least one simulation , DeepSeek ’s R1 , gives resolution it cognise to be wrong for some of the Sunday Puzzle questions . R1 will state verbatim “ I give up , ” espouse by an incorrect solution chosen apparently at random — behavior this human can certainly colligate to .
The poser make other bizarre pick , like yield a incorrect solution only to right away retract it , assay to tease out a better one , and fail again . They also get stuck “ cerebrate ” forever and give nonsensical explanation for answers , or they arrive at a correct response justly away but then go on to take alternative answers for no obvious reasonableness .
“ On hard problems , R1 literally says that it ’s getting ‘ frustrated , ’ ” Guha said . “ It was curious to see how a model emulate what a homo might say . It remains to be see how ‘ frustration ’ in abstract thought can affect the lineament of model results . ”
The current best - do model on the benchmark is o1 with a grudge of 59 % , followed by the late releasedo3 - miniset to in high spirits “ reasoning effort ” ( 47 % ) . ( R1 scored 35 % . ) As a next measure , the research worker plan to broaden their examination to additional abstract thought models , which they go for will help to identify arena where these models might be enhanced .
“ You do n’t need a PhD to be good at logical thinking , so it should be potential to plan abstract thought benchmarks that do n’t command PhD - level knowledge , ” Guha said . “ A bench mark with encompassing access allow a across-the-board set of researchers to comprehend and analyse the results , which may in twist direct to better solutions in the future . Furthermore , as state - of - the - graphics models are increasingly deploy in options that affect everyone , we believe everyone should be able to intuit what these models are — and are n’t — capable of . ”