Topics
Latest
AI
Amazon
Image Credits:Pokémon
Apps
Biotech & Health
Climate
Image Credits:Pokémon
Cloud Computing
Department of Commerce
Crypto
Enterprise
EVs
Fintech
fund raise
contrivance
Gaming
Government & Policy
Hardware
Layoffs
Media & Entertainment
Meta
Microsoft
Privacy
Robotics
Security
societal
Space
startup
TikTok
DoT
Venture
More from TechCrunch
Events
Startup Battlefield
StrictlyVC
Podcasts
video recording
Partner Content
TechCrunch Brand Studio
Crunchboard
meet Us
Not even Pokémon is safe from AI benchmarking controversy .
Last workweek , apost on Xwent viral , take that Google ’s latest Gemini role model pass by Anthropic ’s flagship Claude model in the original Pokémon video game trilogy . Reportedly , Gemini had reach Lavender Town in a developer ’s Twitch flow ; Claude wasstuck at Mount Moonas of former February .
Gemini is literally ahead of Claude atm in pokemon after reaching Lavender Town
119 live view only btw , incredibly underrated streampic.twitter.com/8AvSovAI4x
— Jush ( @Jush21e8)April 10 , 2025
But what the post flush it to cite is that Gemini had an advantage .
Asusers on Redditpointed out , the developer who conserve the Gemini stream built a custom minimap that facilitate the model name “ tiles ” in the game like cuttable trees . This reduces the indigence for Gemini to break down screenshots before it make gameplay conclusion .
Now , Pokémon is a semi - serious AI benchmark at ripe — few would argue it ’s a very informatory mental test of a model ’s capabilities . But itisan instructive illustration of how dissimilar implementations of a benchmark can act upon the results .
For model , Anthropicreportedtwo tons for its late Anthropic 3.7 Sonnet example on the benchmark SWE - bench Verified , which is designed to evaluate a example ’s dupe abilities . Claude 3.7 Sonnet achieved 62.3 % accuracy on SWE - bench Verified , but 70.3 % with a “ custom scaffold ” that Anthropic developed .
Join us at TechCrunch Sessions: AI
Exhibit at TechCrunch Sessions: AI
More recently , Metafine - tuneda version of one of its newer role model , Llama 4 Maverick , to perform well on a special bench mark , LM Arena . Thevanilla versionof the poser gobs significantly worse on the same evaluation .
give that AI benchmark — Pokémon included — areimperfect measuresto begin with , custom and non - standard implementations jeopardise to muddy the piddle even further . That is to say , it does n’t seem potential that it ’ll get any easier to liken model as they ’re release .