A high schooler built a website that lets you challenge AI models to a Minecraft build-off

Topics

recent

Amazon

Image Credits:Minecraft

Apps

Biotech & Health

Climate

scene from Minecraft

Image Credits:Minecraft

Cloud Computing

Commerce

Crypto

Image Credits:Minecraft Benchmark(opens in a new window)

Enterprise

EVs

Fintech

Image Credits:Minecraft Benchmark

fundraise

appliance

Gaming

Google

Government & Policy

Hardware

Instagram

layoff

Media & Entertainment

More from TechCrunch

consequence

Startup Battlefield

StrictlyVC

Podcasts

Videos

Partner Content

TechCrunch Brand Studio

Crunchboard

As conventionalAI benchmarkingtechniques show inadequate , AI builder are turning to more creative ways to appraise the capabilities of generative AI models . For one group of developers , that ’s Minecraft , the Microsoft - owned sandbox - building plot .

The websiteMinecraft Benchmark(or MC - Bench ) was developed collaboratively to pit AI models against each other in heading - to - foreland challenges to respond to command prompt with Minecraft cosmos . substance abuser can vote on which model did a better job , and only after ballot can they see which AI made each Minecraft build .

For Adi Singh , the 12th - grader who started MC - Bench , the value of Minecraft is n’t so much the biz itself , but the familiarity that people have with it — after all , it is thebest - sellingvideo game of all time . Even for masses who have n’t act the game , it ’s still potential to evaluate which blockish histrionics of a Ananas comosus is better take in .

“ Minecraft leave people to see the progress [ of AI development ] much more easy , ” Singh secernate TechCrunch . “ People are used to Minecraft , used to the look and the vibe . ”

MC - Bench currently lists eight the great unwashed as voluntary contributors . Anthropic , Google , OpenAI , and Alibaba have subsidise the project ’s use of their products to run benchmark prompting , per MC - Bench ’s web site , but the companies are not otherwise affiliated .

“ presently we are just doing simple physique to reflect on how far we ’ve come from the GPT-3 era , but [ we ] could see ourselves scaling to these longer - flesh design and goal - oriented tasks , ” Singh said . “ Games might just be a medium to test agentic abstract thought that is safer than in real life-time and more controllable for examination purposes , make it more idealistic in my eye . ”

Other game likePokémon Red , Street Fighter , andPictionaryhave been used as experimental benchmarks for AI , in part because the fine art of benchmarking AI isnotoriously tricky .

Join us at TechCrunch Sessions: AI

Exhibit at TechCrunch Sessions: AI

Researchers often test AI models onstandardized evaluations , but many of these tests give AI a home - field reward . Because of the way they ’re trained , role model are naturally gifted at sure , narrow kinds of problem - solving , especially job - solving that command rote memorization or basic extrapolation .

Put merely , it ’s hard to reap what it mean that OpenAI ’s GPT-4 can score in the 88th percentile on the LSAT , but can not discernhow many Rs are in the word “ hemangioma simplex . ”Anthropic’sClaude 3.7 Sonnetachieved 62.3 % truth on a standardized software applied science bench mark , but it is worse at play Pokémon than most five - year - olds .

MC - Bench is technically a programming benchmark , since the models are asked to compose code to make the prompted build , like “ Frosty the Snowman ” or “ a wizardly tropical beach hut on a pristine sandy shoring . ”

But it ’s easier for most MC - Bench users to evaluate whether a snowman attend considerably than to dig into codification , which gives the project wide-cut appeal — and thus the potential to collect more data about which manikin systematically score well .

Whether those heaps amount to much in the way of AI usefulness is up for argument , of course . Singh swear that they ’re a strong signal , though .

“ The current leaderboard reflects quite closely to my own experience of using these models , which is unlike a heap of pure text benchmark , ” Singh said . “ possibly [ MC - Bench ] could be utilitarian to companies to have it away if they ’re maneuver in the right instruction . ”

Topics#

More from TechCrunch#

Join us at TechCrunch Sessions: AI#

Exhibit at TechCrunch Sessions: AI#

Topics

More from TechCrunch

Join us at TechCrunch Sessions: AI

Exhibit at TechCrunch Sessions: AI