Anthropic used Pokémon to benchmark its newest AI model

Topics

Latest

Amazon

Image Credits:Pokémon

Apps

Biotech & Health

Climate

Pokémon

Image Credits:Pokémon

Cloud Computing

Department of Commerce

Crypto

Anthropic Pokemon Red

Image Credits:Anthropic

initiative

EVs

Fintech

fund-raise

Gadgets

Gaming

Google

Government & Policy

ironware

Instagram

Layoffs

Media & Entertainment

More from TechCrunch

effect

Startup Battlefield

StrictlyVC

Podcasts

picture

Partner Content

TechCrunch Brand Studio

Crunchboard

Anthropic used Pokémon to benchmark its newest AI model . Yes , really .

In a blogpostpublished Monday , Anthropic said that it tested its latest framework , Claude 3.7 Sonnet , on the Game Boy classic Pokémon Red . The company equipped the model with canonical memory , screenland pixel input , and function calls to bid buttons and navigate around the screen , allowing it to toy Pokémon endlessly .

A unique lineament of Claude 3.7 Sonnet is its power to rent in “ drawn-out thinking . ” Like OpenAI ’s o3 - mini and DeepSeek ’s R1 , Claude 3.7 Sonnet can “ reason ” through challenging job by applying more computing — and take more time .

That come in handy in Pokémon Red , obviously .

Compared to a previous version of Claude , Claude 3.0 Sonnet , which failed to bequeath the theatre in Pallet Town where the fib begins , Claude 3.7 Sonnet successfully combat three Pokémon gym leaders and bring home the bacon their badges .

Now , it ’s not clear how much computing was required for Claude 3.7 Sonnet to reach those milestones — and how long each pick out . Anthropic only said that the model perform 35,000 actions to reach the last gym loss leader , Surge .

Last week , a researcher tried out an early preview of Claude 3.7 Sonnet . The results were impress . Within hours , Claude defeated Brock . Days after , it trounced Misty . build that old models had little promise of attain . Turns out extended thinking is super effective.pic.twitter.com/RspsLgj2Uf

It sure as shooting wo n’t be long before some enterprising developer finds out .

Join us at TechCrunch Sessions: AI

Exhibit at TechCrunch Sessions: AI

Pokémon Red is more of a toy dog benchmark than anything . However , thereisa long historyof game being used for AI benchmarking function . In the retiring few month alone , a number of new apps and platforms have cropped up to test models ’ plot - act ability on titles ranging fromStreet FightertoPictionary .

Topics#

More from TechCrunch#

Join us at TechCrunch Sessions: AI#

Exhibit at TechCrunch Sessions: AI#

Topics

More from TechCrunch

Join us at TechCrunch Sessions: AI

Exhibit at TechCrunch Sessions: AI