Topics

Latest

AI

Amazon

Article image

Image Credits:@kimmonismus(opens in a new window)

Apps

Biotech & Health

clime

Will Smith spaghetti AI video

Image Credits:@kimmonismus(opens in a new window)

Cloud Computing

mercantilism

Crypto

LLM Pictionary

Image Credits:Paul Calcraft

Enterprise

EVs

Fintech

LMSYS

The Chatbot Arena interface.Image Credits:LMSYS

Fundraising

gadget

Gaming

Mcbench

Note the typo; there’s no such model as Claude 3.6 Sonnet.Image Credits:Adonis Singh

Google

Government & Policy

Hardware

Instagram

Layoffs

Media & Entertainment

Meta

Microsoft

secrecy

Robotics

Security

societal

Space

Startups

TikTok

Transportation

Venture

More from TechCrunch

Events

Startup Battlefield

StrictlyVC

Podcasts

Videos

Partner Content

TechCrunch Brand Studio

Crunchboard

Contact Us

When a company releases a new AI video source , it ’s not long before someone employ it to make a picture of actor Will Smith eat spaghetti .

It ’s become something of a meme as well as a benchmark : Seeing whether a Modern TV generator can realistically interpret Smith slurping down a bowl of noggin . Adam Smith himselfparodiedthe vogue in an Instagram post in February .

Google Veo 2 has done it .

We are now eating spaghett at last.pic.twitter.com/AZO81w8JC0

— Jerrod Lew ( @jerrod_lew)December 17 , 2024

Will Smith and alimentary paste is but one of severalbizarre “ unofficial ” benchmarksto take the AI community by storm in 2024 . A 16 - yr - sure-enough developer built an app that gives AI control over Minecraft and tests its ability to design structure . Elsewhere , a British software engineer   make a platform where AI plays game like Pictionary and Connect 4 against each other .

It ’s not like there are n’t more academic tests of an AI ’s operation . So why did the weirder ones blow up ?

For one , many of the industry - standard AI bench mark do n’t differentiate the average individual very much . Companies often mention their AI ’s power to answer questions on Math Olympiad exams , or enter out plausible solvent to PhD - level problem . Yet most people — yours truly admit — utilization chatbots for things likeresponding to e-mail and basic research .

Crowdsourced manufacture measures are n’t necessarily good or more informative .

Join us at TechCrunch Sessions: AI

Exhibit at TechCrunch Sessions: AI

Take , for example , Chatbot Arena , a public benchmark many AI enthusiasts and developer accompany obsessionally . Chatbot Arena let anyone on the web rate how well AI perform on particular task , like create a web app or get an image . But raters run not to be representative — most come from AI and tech industry circles — and cast their votes based on personal , hard - to - pin - down preferences .

Ethan Mollick , a professor of management at Wharton , recently head out in aposton X another problem with many AI industry benchmarks : they do n’t compare a system ’s execution to that of the intermediate person .

“ The fact that there are not 30 different benchmarks from different organization in medicament , in law , in advice quality , and so on is a real shame , as people are using system for these thing , irrespective , ” Mollick wrote .

Weird AI benchmarks like Connect 4 , Minecraft , and Will Smith eat spaghetti are most certainlynotempirical — or even all that generalizable . Just because an AI nail the Will Smith test does n’t intend it ’ll generate , say , a Warren Earl Burger well .

One expert I talk to about AI benchmarks suggested that the AI community focus on the downstream impacts of AI instead of its ability in minute domain . That ’s sensible . But I have a feeling that weird benchmarks are n’t locomote away anytime soon . Not only are they entertaining — who does n’t like watching AI build Minecraft rook ? — but they ’re light to interpret . And as my co-worker Max Zeffwrote about recently , the industriousness continues to get by with distill a technology as complex as AI into digestible marketing .

The only question in my mind is , which odd new benchmarks will go viral in 2025 ?