Debates over AI benchmarking have reached Pokémon

Topics

Latest

Amazon

Image Credits:Pokémon

Apps

Biotech & Health

Climate

Pokémon

Image Credits:Pokémon

Cloud Computing

Department of Commerce

Crypto

Enterprise

EVs

Fintech

fund raise

contrivance

Gaming

Google

Government & Policy

Hardware

Instagram

Layoffs

Media & Entertainment

More from TechCrunch

Events

Startup Battlefield

StrictlyVC

Podcasts

video recording

Partner Content

TechCrunch Brand Studio

Crunchboard

meet Us

Not even Pokémon is safe from AI benchmarking controversy .

Last workweek , apost on Xwent viral , take that Google ’s latest Gemini role model pass by Anthropic ’s flagship Claude model in the original Pokémon video game trilogy . Reportedly , Gemini had reach Lavender Town in a developer ’s Twitch flow ; Claude wasstuck at Mount Moonas of former February .

Gemini is literally ahead of Claude atm in pokemon after reaching Lavender Town

119 live view only btw , incredibly underrated streampic.twitter.com/8AvSovAI4x

— Jush ( @Jush21e8)April 10 , 2025

But what the post flush it to cite is that Gemini had an advantage .

Asusers on Redditpointed out , the developer who conserve the Gemini stream built a custom minimap that facilitate the model name “ tiles ” in the game like cuttable trees . This reduces the indigence for Gemini to break down screenshots before it make gameplay conclusion .

Now , Pokémon is a semi - serious AI benchmark at ripe — few would argue it ’s a very informatory mental test of a model ’s capabilities . But itisan instructive illustration of how dissimilar implementations of a benchmark can act upon the results .

For model , Anthropicreportedtwo tons for its late Anthropic 3.7 Sonnet example on the benchmark SWE - bench Verified , which is designed to evaluate a example ’s dupe abilities . Claude 3.7 Sonnet achieved 62.3 % accuracy on SWE - bench Verified , but 70.3 % with a “ custom scaffold ” that Anthropic developed .

Join us at TechCrunch Sessions: AI

Exhibit at TechCrunch Sessions: AI

More recently , Metafine - tuneda version of one of its newer role model , Llama 4 Maverick , to perform well on a special bench mark , LM Arena . Thevanilla versionof the poser gobs significantly worse on the same evaluation .

give that AI benchmark — Pokémon included — areimperfect measuresto begin with , custom and non - standard implementations jeopardise to muddy the piddle even further . That is to say , it does n’t seem potential that it ’ll get any easier to liken model as they ’re release .

Topics#

More from TechCrunch#

Join us at TechCrunch Sessions: AI#

Exhibit at TechCrunch Sessions: AI#

Topics

More from TechCrunch

Join us at TechCrunch Sessions: AI

Exhibit at TechCrunch Sessions: AI