Topics

Latest

AI

Amazon

Article image

Image Credits:Jamie Jones / Getty Images

Apps

Biotech & Health

Climate

Article image

Image Credits:Jamie Jones / Getty Images

Cloud Computing

Commerce

Crypto

Article image

Image Credits:Nathan Laine/Bloomberg / Getty Images

Enterprise

EVs

Fintech

OpenAI ChatGPT website displayed on a laptop screen is seen in this illustration photo.

Image Credits:Jakub Porzycki/NurPhoto / Getty Images

Fundraising

Gadgets

game

Nous Research DeepHermes

Image Credits:Nous Research

Google

Government & Policy

ironware

Instagram

Layoffs

Media & Entertainment

Meta

Microsoft

concealment

Robotics

protection

societal

Space

startup

TikTok

shipping

Venture

More from TechCrunch

outcome

Startup Battlefield

StrictlyVC

Podcasts

Videos

Partner Content

TechCrunch Brand Studio

Crunchboard

adjoin Us

This calendar week , billionaire Elon Musk ’s AI inauguration , xAI , released its late flagship AI manakin , Grok 3 , which powers the company ’s Grok chatbot apps . Trained on around 200,000 GPUs , the model beat a number of other leading framework , include from OpenAI , on benchmarks for mathematics , scheduling , and more .

But what do these benchmarks really separate us ?

Here at TC , we often reluctantly report benchmark figure of speech because they ’re one of the few ( comparatively ) standardized ways the AI industriousness measures modeling improvements . Popular AI benchmarks tend to test foresoteric knowledge , and give mass score that correlate poorly to proficiencyon the tasks that most citizenry care about .

As Wharton professor Ethan Mollick point out ina series of Charles William Post on Xafter Grok 3 ’s unveiling Monday , there ’s an “ urgent penury for beneficial batteries of psychometric test and main testing authorisation . ” AI caller self - paper bench mark results more often than not , as Mollick touch to , make those results even tougher to assume at face economic value .

“ Public benchmark are both ‘ meh ’ and saturated , leaving a lot of AI testing to be like food for thought reviews , based on taste , ” Mollick wrote . “ If AI is decisive to work , we ask more . ”

There ’s no shortage ofindependenttestsandorganizationsproposing new benchmarks for AI , but their comparative merit is far from a finalize matter within the industriousness . Some AI commentator and experts proposealigning benchmarks with economic impactto ensure their utility , whileothers argue that adoption and utilityare the ultimate benchmark .

Join us at TechCrunch Sessions: AI

Exhibit at TechCrunch Sessions: AI

This debate may rage until the final stage of time . Perhaps we should rather , as X user   Roon prescribes , merely pay less care to newfangled manikin and benchmark barring major AI expert breakthrough . For our collective sanity , that may not be the worst idea , even if it does induce some level of AI FOMO .

As mentioned above , This Week in AI is going on hiatus . Thanks for sticking with us , lecturer , through this tumbler coaster of a journey . Until next fourth dimension .

News

OpenAI tries to “ uncensor ” ChatGPT : Max wrote about how OpenAI is changing its AI development glide path to explicitly embrace “ cerebral exemption , ” no matter how intriguing or controversial a issue may be .

Mira ’s young inauguration : Former OpenAI CTO Mira Murati ’s new inauguration , Thinking Machines Lab , intends to build tools to “ make AI work for [ people ’s ] unique needs and goals . ”

Grok 3 cometh : Elon Musk ’s AI startup , xAI , has released its latest flagship AI example , Grok 3 , and uncover new capableness for the Grok apps for iOS and the web .

A very Llama conference : Meta will host its first developer conference dedicated to productive AI this spring . name LlamaCon after Meta ’s Llama family of procreative AI models , the conference is schedule for April 29 .

AI and Europe ’s digital sovereignty : Paul profiled OpenEuroLLM , a collaboration between some 20 organizations to build “ a serial of foundation models for transparent AI in Europe ” that preserves the “ linguistic and ethnical variety ” of all EU language .

Research paper of the week

OpenAI researchers have create a new AI benchmark , SWE - Lancer , that aims to evaluate the coding prowess of powerful AI system . The bench mark consists of over 1,400 freelance software engineering project that range from bug fixes and sport deployments to “ manager - level ” proficient implementation proposals .

consort to OpenAI , the well - perform AI model , Anthropic ’s Claude 3.5 Sonnet , score 40.3 % on the full SWE - Lancer benchmark — suggesting that AI has quite a ways to go . It ’s deserving take down that the research worker did n’t benchmark newer model like OpenAI’so3 - minior Chinese AI companyDeepSeek ’s R1 .

Model of the week

A Chinese AI company named Stepfun has released an “ open ” AI simulation , Step - Audio , that can understand and bring forth speech in several languages . Step - Audio underpin Chinese , English , and Japanese and lets users line up the emotion and even dialect of the synthetic audio it creates , including singing .

Stepfun is one of several well - funded Chinese AI startups releasing models under a permissive license . Founded in 2023 , Stepfunreportedly late closeda funding round worth several hundred million dollar mark from a host of   investors that include Chinese country - owned private fairness firms .

Grab bag

Nous Research , an AI research group , hasreleasedwhat it claim is one of the first AI models that unifies reasoning and “ nonrational linguistic communication model capableness . ”

The fashion model , DeepHermes-3 Preview , can toggle on and off retentive “ chain of intellection ” for improved truth at the price of some computational ponderosity . In “ reasoning ” modality , DeepHermes-3 Preview , standardised to other reasoning AI models , “ thinks ” longer for harder problems and shows its cerebration process to arrive at the answer .

Anthropic reportedlyplans to put out an architecturally similar poser soon , and OpenAI has say such a modelling ison its close - term roadmap .