Topics
Latest
AI
Amazon
Image Credits:@kimmonismus(opens in a new window)
Apps
Biotech & Health
clime
Image Credits:@kimmonismus(opens in a new window)
Cloud Computing
mercantilism
Crypto
Image Credits:Paul Calcraft
Enterprise
EVs
Fintech
The Chatbot Arena interface.Image Credits:LMSYS
Fundraising
gadget
Gaming
Note the typo; there’s no such model as Claude 3.6 Sonnet.Image Credits:Adonis Singh
Government & Policy
Hardware
Layoffs
Media & Entertainment
Meta
Microsoft
secrecy
Robotics
Security
societal
Space
Startups
TikTok
Transportation
Venture
More from TechCrunch
Events
Startup Battlefield
StrictlyVC
Podcasts
Videos
Partner Content
TechCrunch Brand Studio
Crunchboard
Contact Us
When a company releases a new AI video source , it ’s not long before someone employ it to make a picture of actor Will Smith eat spaghetti .
It ’s become something of a meme as well as a benchmark : Seeing whether a Modern TV generator can realistically interpret Smith slurping down a bowl of noggin . Adam Smith himselfparodiedthe vogue in an Instagram post in February .
Google Veo 2 has done it .
We are now eating spaghett at last.pic.twitter.com/AZO81w8JC0
— Jerrod Lew ( @jerrod_lew)December 17 , 2024
Will Smith and alimentary paste is but one of severalbizarre “ unofficial ” benchmarksto take the AI community by storm in 2024 . A 16 - yr - sure-enough developer built an app that gives AI control over Minecraft and tests its ability to design structure . Elsewhere , a British software engineer make a platform where AI plays game like Pictionary and Connect 4 against each other .
It ’s not like there are n’t more academic tests of an AI ’s operation . So why did the weirder ones blow up ?
For one , many of the industry - standard AI bench mark do n’t differentiate the average individual very much . Companies often mention their AI ’s power to answer questions on Math Olympiad exams , or enter out plausible solvent to PhD - level problem . Yet most people — yours truly admit — utilization chatbots for things likeresponding to e-mail and basic research .
Crowdsourced manufacture measures are n’t necessarily good or more informative .
Join us at TechCrunch Sessions: AI
Exhibit at TechCrunch Sessions: AI
Take , for example , Chatbot Arena , a public benchmark many AI enthusiasts and developer accompany obsessionally . Chatbot Arena let anyone on the web rate how well AI perform on particular task , like create a web app or get an image . But raters run not to be representative — most come from AI and tech industry circles — and cast their votes based on personal , hard - to - pin - down preferences .
Ethan Mollick , a professor of management at Wharton , recently head out in aposton X another problem with many AI industry benchmarks : they do n’t compare a system ’s execution to that of the intermediate person .
“ The fact that there are not 30 different benchmarks from different organization in medicament , in law , in advice quality , and so on is a real shame , as people are using system for these thing , irrespective , ” Mollick wrote .
Weird AI benchmarks like Connect 4 , Minecraft , and Will Smith eat spaghetti are most certainlynotempirical — or even all that generalizable . Just because an AI nail the Will Smith test does n’t intend it ’ll generate , say , a Warren Earl Burger well .
One expert I talk to about AI benchmarks suggested that the AI community focus on the downstream impacts of AI instead of its ability in minute domain . That ’s sensible . But I have a feeling that weird benchmarks are n’t locomote away anytime soon . Not only are they entertaining — who does n’t like watching AI build Minecraft rook ? — but they ’re light to interpret . And as my co-worker Max Zeffwrote about recently , the industriousness continues to get by with distill a technology as complex as AI into digestible marketing .
The only question in my mind is , which odd new benchmarks will go viral in 2025 ?