Topics
recent
AI
Amazon
Image Credits:Minecraft
Apps
Biotech & Health
Climate
Image Credits:Minecraft
Cloud Computing
Commerce
Crypto
Image Credits:Minecraft Benchmark(opens in a new window)
Enterprise
EVs
Fintech
Image Credits:Minecraft Benchmark
fundraise
appliance
Gaming
Government & Policy
Hardware
layoff
Media & Entertainment
Meta
Microsoft
Privacy
Robotics
Security
societal
infinite
startup
TikTok
exile
speculation
More from TechCrunch
consequence
Startup Battlefield
StrictlyVC
Podcasts
Videos
Partner Content
TechCrunch Brand Studio
Crunchboard
Contact Us
As conventionalAI benchmarkingtechniques show inadequate , AI builder are turning to more creative ways to appraise the capabilities of generative AI models . For one group of developers , that ’s Minecraft , the Microsoft - owned sandbox - building plot .
The websiteMinecraft Benchmark(or MC - Bench ) was developed collaboratively to pit AI models against each other in heading - to - foreland challenges to respond to command prompt with Minecraft cosmos . substance abuser can vote on which model did a better job , and only after ballot can they see which AI made each Minecraft build .
For Adi Singh , the 12th - grader who started MC - Bench , the value of Minecraft is n’t so much the biz itself , but the familiarity that people have with it — after all , it is thebest - sellingvideo game of all time . Even for masses who have n’t act the game , it ’s still potential to evaluate which blockish histrionics of a Ananas comosus is better take in .
“ Minecraft leave people to see the progress [ of AI development ] much more easy , ” Singh secernate TechCrunch . “ People are used to Minecraft , used to the look and the vibe . ”
MC - Bench currently lists eight the great unwashed as voluntary contributors . Anthropic , Google , OpenAI , and Alibaba have subsidise the project ’s use of their products to run benchmark prompting , per MC - Bench ’s web site , but the companies are not otherwise affiliated .
“ presently we are just doing simple physique to reflect on how far we ’ve come from the GPT-3 era , but [ we ] could see ourselves scaling to these longer - flesh design and goal - oriented tasks , ” Singh said . “ Games might just be a medium to test agentic abstract thought that is safer than in real life-time and more controllable for examination purposes , make it more idealistic in my eye . ”
Other game likePokémon Red , Street Fighter , andPictionaryhave been used as experimental benchmarks for AI , in part because the fine art of benchmarking AI isnotoriously tricky .
Join us at TechCrunch Sessions: AI
Exhibit at TechCrunch Sessions: AI
Researchers often test AI models onstandardized evaluations , but many of these tests give AI a home - field reward . Because of the way they ’re trained , role model are naturally gifted at sure , narrow kinds of problem - solving , especially job - solving that command rote memorization or basic extrapolation .
Put merely , it ’s hard to reap what it mean that OpenAI ’s GPT-4 can score in the 88th percentile on the LSAT , but can not discernhow many Rs are in the word “ hemangioma simplex . ”Anthropic’sClaude 3.7 Sonnetachieved 62.3 % truth on a standardized software applied science bench mark , but it is worse at play Pokémon than most five - year - olds .
MC - Bench is technically a programming benchmark , since the models are asked to compose code to make the prompted build , like “ Frosty the Snowman ” or “ a wizardly tropical beach hut on a pristine sandy shoring . ”
But it ’s easier for most MC - Bench users to evaluate whether a snowman attend considerably than to dig into codification , which gives the project wide-cut appeal — and thus the potential to collect more data about which manikin systematically score well .
Whether those heaps amount to much in the way of AI usefulness is up for argument , of course . Singh swear that they ’re a strong signal , though .
“ The current leaderboard reflects quite closely to my own experience of using these models , which is unlike a heap of pure text benchmark , ” Singh said . “ possibly [ MC - Bench ] could be utilitarian to companies to have it away if they ’re maneuver in the right instruction . ”