This Week in AI: Maybe we should ignore AI benchmarks for now

Topics

Latest

Amazon

Image Credits:Jamie Jones / Getty Images

Apps

Biotech & Health

Climate

Image Credits:Jamie Jones / Getty Images

Cloud Computing

Commerce

Crypto

Image Credits:Nathan Laine/Bloomberg / Getty Images

Enterprise

EVs

Fintech

OpenAI ChatGPT website displayed on a laptop screen is seen in this illustration photo.

Image Credits:Jakub Porzycki/NurPhoto / Getty Images

Fundraising

Gadgets

game

Nous Research DeepHermes

Image Credits:Nous Research

Google

Government & Policy

ironware

Instagram

Layoffs

Media & Entertainment

More from TechCrunch

outcome

Startup Battlefield

StrictlyVC

Podcasts

Videos

Partner Content

TechCrunch Brand Studio

Crunchboard

adjoin Us

This calendar week , billionaire Elon Musk ’s AI inauguration , xAI , released its late flagship AI manakin , Grok 3 , which powers the company ’s Grok chatbot apps . Trained on around 200,000 GPUs , the model beat a number of other leading framework , include from OpenAI , on benchmarks for mathematics , scheduling , and more .

But what do these benchmarks really separate us ?

Here at TC , we often reluctantly report benchmark figure of speech because they ’re one of the few ( comparatively ) standardized ways the AI industriousness measures modeling improvements . Popular AI benchmarks tend to test foresoteric knowledge , and give mass score that correlate poorly to proficiencyon the tasks that most citizenry care about .

As Wharton professor Ethan Mollick point out ina series of Charles William Post on Xafter Grok 3 ’s unveiling Monday , there ’s an “ urgent penury for beneficial batteries of psychometric test and main testing authorisation . ” AI caller self - paper bench mark results more often than not , as Mollick touch to , make those results even tougher to assume at face economic value .

“ Public benchmark are both ‘ meh ’ and saturated , leaving a lot of AI testing to be like food for thought reviews , based on taste , ” Mollick wrote . “ If AI is decisive to work , we ask more . ”

There ’s no shortage ofindependenttestsandorganizationsproposing new benchmarks for AI , but their comparative merit is far from a finalize matter within the industriousness . Some AI commentator and experts proposealigning benchmarks with economic impactto ensure their utility , whileothers argue that adoption and utilityare the ultimate benchmark .

Join us at TechCrunch Sessions: AI

Exhibit at TechCrunch Sessions: AI

This debate may rage until the final stage of time . Perhaps we should rather , as X user Roon prescribes , merely pay less care to newfangled manikin and benchmark barring major AI expert breakthrough . For our collective sanity , that may not be the worst idea , even if it does induce some level of AI FOMO .

As mentioned above , This Week in AI is going on hiatus . Thanks for sticking with us , lecturer , through this tumbler coaster of a journey . Until next fourth dimension .

News

OpenAI tries to “ uncensor ” ChatGPT : Max wrote about how OpenAI is changing its AI development glide path to explicitly embrace “ cerebral exemption , ” no matter how intriguing or controversial a issue may be .

Mira ’s young inauguration : Former OpenAI CTO Mira Murati ’s new inauguration , Thinking Machines Lab , intends to build tools to “ make AI work for [ people ’s ] unique needs and goals . ”

Grok 3 cometh : Elon Musk ’s AI startup , xAI , has released its latest flagship AI example , Grok 3 , and uncover new capableness for the Grok apps for iOS and the web .

A very Llama conference : Meta will host its first developer conference dedicated to productive AI this spring . name LlamaCon after Meta ’s Llama family of procreative AI models , the conference is schedule for April 29 .

AI and Europe ’s digital sovereignty : Paul profiled OpenEuroLLM , a collaboration between some 20 organizations to build “ a serial of foundation models for transparent AI in Europe ” that preserves the “ linguistic and ethnical variety ” of all EU language .

Research paper of the week

OpenAI researchers have create a new AI benchmark , SWE - Lancer , that aims to evaluate the coding prowess of powerful AI system . The bench mark consists of over 1,400 freelance software engineering project that range from bug fixes and sport deployments to “ manager - level ” proficient implementation proposals .

consort to OpenAI , the well - perform AI model , Anthropic ’s Claude 3.5 Sonnet , score 40.3 % on the full SWE - Lancer benchmark — suggesting that AI has quite a ways to go . It ’s deserving take down that the research worker did n’t benchmark newer model like OpenAI’so3 - minior Chinese AI companyDeepSeek ’s R1 .

Model of the week

A Chinese AI company named Stepfun has released an “ open ” AI simulation , Step - Audio , that can understand and bring forth speech in several languages . Step - Audio underpin Chinese , English , and Japanese and lets users line up the emotion and even dialect of the synthetic audio it creates , including singing .

Stepfun is one of several well - funded Chinese AI startups releasing models under a permissive license . Founded in 2023 , Stepfunreportedly late closeda funding round worth several hundred million dollar mark from a host of investors that include Chinese country - owned private fairness firms .

Grab bag

Nous Research , an AI research group , hasreleasedwhat it claim is one of the first AI models that unifies reasoning and “ nonrational linguistic communication model capableness . ”

The fashion model , DeepHermes-3 Preview , can toggle on and off retentive “ chain of intellection ” for improved truth at the price of some computational ponderosity . In “ reasoning ” modality , DeepHermes-3 Preview , standardised to other reasoning AI models , “ thinks ” longer for harder problems and shows its cerebration process to arrive at the answer .

Anthropic reportedlyplans to put out an architecturally similar poser soon , and OpenAI has say such a modelling ison its close - term roadmap .

Topics#

More from TechCrunch#

Join us at TechCrunch Sessions: AI#

Exhibit at TechCrunch Sessions: AI#

News#

Research paper of the week#

Model of the week#

Grab bag#