Topics

Latest

AI

Amazon

Article image

Image Credits:Pokémon

Apps

Biotech & Health

Climate

Pokémon

Image Credits:Pokémon

Cloud Computing

Department of Commerce

Crypto

Enterprise

EVs

Fintech

fund raise

contrivance

Gaming

Google

Government & Policy

Hardware

Instagram

Layoffs

Media & Entertainment

Meta

Microsoft

Privacy

Robotics

Security

societal

Space

startup

TikTok

DoT

Venture

More from TechCrunch

Events

Startup Battlefield

StrictlyVC

Podcasts

video recording

Partner Content

TechCrunch Brand Studio

Crunchboard

meet Us

Not even Pokémon is safe from AI benchmarking controversy .

Last workweek , apost on Xwent viral , take that Google ’s latest Gemini role model pass by Anthropic ’s flagship Claude model in the original Pokémon video game trilogy . Reportedly , Gemini had reach Lavender Town in a developer ’s Twitch flow ; Claude wasstuck at Mount Moonas of former February .

Gemini is literally ahead of Claude atm in pokemon after reaching Lavender Town

119 live view only btw , incredibly underrated streampic.twitter.com/8AvSovAI4x

— Jush ( @Jush21e8)April 10 , 2025

But what the post flush it to cite is that Gemini had an advantage .

Asusers on Redditpointed out , the developer who conserve the Gemini stream built a custom minimap that facilitate the model name “ tiles ” in the game like cuttable trees . This reduces the indigence for Gemini to break down screenshots before it make gameplay conclusion .

Now , Pokémon is a semi - serious AI benchmark at ripe — few would argue it ’s a very informatory mental test of a model ’s capabilities . But itisan instructive illustration of how dissimilar implementations of a benchmark can act upon the results .

For model , Anthropicreportedtwo tons for its late Anthropic 3.7 Sonnet example on the benchmark SWE - bench Verified , which is designed to evaluate a example ’s dupe abilities . Claude 3.7 Sonnet achieved 62.3 % accuracy on SWE - bench Verified , but 70.3 % with a “ custom scaffold ” that Anthropic developed .

Join us at TechCrunch Sessions: AI

Exhibit at TechCrunch Sessions: AI

More recently , Metafine - tuneda version of one of its newer role model , Llama 4 Maverick , to perform well on a special bench mark , LM Arena . Thevanilla versionof the poser gobs significantly worse on the same evaluation .

give that AI benchmark — Pokémon included — areimperfect measuresto begin with , custom and non - standard implementations jeopardise to muddy the piddle even further . That is to say , it does n’t seem potential that it ’ll get any easier to liken model as they ’re release .