Topics
Latest
AI
Amazon
Image Credits:Jamie Jones / Getty Images
Apps
Biotech & Health
Climate
Image Credits:Jamie Jones / Getty Images
Cloud Computing
Commerce
Crypto
Image Credits:Nathan Laine/Bloomberg / Getty Images
Enterprise
EVs
Fintech
Image Credits:Jakub Porzycki/NurPhoto / Getty Images
Fundraising
Gadgets
game
Image Credits:Nous Research
Government & Policy
ironware
Layoffs
Media & Entertainment
Meta
Microsoft
concealment
Robotics
protection
societal
Space
startup
TikTok
shipping
Venture
More from TechCrunch
outcome
Startup Battlefield
StrictlyVC
Podcasts
Videos
Partner Content
TechCrunch Brand Studio
Crunchboard
adjoin Us
This calendar week , billionaire Elon Musk ’s AI inauguration , xAI , released its late flagship AI manakin , Grok 3 , which powers the company ’s Grok chatbot apps . Trained on around 200,000 GPUs , the model beat a number of other leading framework , include from OpenAI , on benchmarks for mathematics , scheduling , and more .
But what do these benchmarks really separate us ?
Here at TC , we often reluctantly report benchmark figure of speech because they ’re one of the few ( comparatively ) standardized ways the AI industriousness measures modeling improvements . Popular AI benchmarks tend to test foresoteric knowledge , and give mass score that correlate poorly to proficiencyon the tasks that most citizenry care about .
As Wharton professor Ethan Mollick point out ina series of Charles William Post on Xafter Grok 3 ’s unveiling Monday , there ’s an “ urgent penury for beneficial batteries of psychometric test and main testing authorisation . ” AI caller self - paper bench mark results more often than not , as Mollick touch to , make those results even tougher to assume at face economic value .
“ Public benchmark are both ‘ meh ’ and saturated , leaving a lot of AI testing to be like food for thought reviews , based on taste , ” Mollick wrote . “ If AI is decisive to work , we ask more . ”
There ’s no shortage ofindependenttestsandorganizationsproposing new benchmarks for AI , but their comparative merit is far from a finalize matter within the industriousness . Some AI commentator and experts proposealigning benchmarks with economic impactto ensure their utility , whileothers argue that adoption and utilityare the ultimate benchmark .
Join us at TechCrunch Sessions: AI
Exhibit at TechCrunch Sessions: AI
This debate may rage until the final stage of time . Perhaps we should rather , as X user Roon prescribes , merely pay less care to newfangled manikin and benchmark barring major AI expert breakthrough . For our collective sanity , that may not be the worst idea , even if it does induce some level of AI FOMO .
As mentioned above , This Week in AI is going on hiatus . Thanks for sticking with us , lecturer , through this tumbler coaster of a journey . Until next fourth dimension .
News
OpenAI tries to “ uncensor ” ChatGPT : Max wrote about how OpenAI is changing its AI development glide path to explicitly embrace “ cerebral exemption , ” no matter how intriguing or controversial a issue may be .
Mira ’s young inauguration : Former OpenAI CTO Mira Murati ’s new inauguration , Thinking Machines Lab , intends to build tools to “ make AI work for [ people ’s ] unique needs and goals . ”
Grok 3 cometh : Elon Musk ’s AI startup , xAI , has released its latest flagship AI example , Grok 3 , and uncover new capableness for the Grok apps for iOS and the web .
A very Llama conference : Meta will host its first developer conference dedicated to productive AI this spring . name LlamaCon after Meta ’s Llama family of procreative AI models , the conference is schedule for April 29 .
AI and Europe ’s digital sovereignty : Paul profiled OpenEuroLLM , a collaboration between some 20 organizations to build “ a serial of foundation models for transparent AI in Europe ” that preserves the “ linguistic and ethnical variety ” of all EU language .
Research paper of the week
OpenAI researchers have create a new AI benchmark , SWE - Lancer , that aims to evaluate the coding prowess of powerful AI system . The bench mark consists of over 1,400 freelance software engineering project that range from bug fixes and sport deployments to “ manager - level ” proficient implementation proposals .
consort to OpenAI , the well - perform AI model , Anthropic ’s Claude 3.5 Sonnet , score 40.3 % on the full SWE - Lancer benchmark — suggesting that AI has quite a ways to go . It ’s deserving take down that the research worker did n’t benchmark newer model like OpenAI’so3 - minior Chinese AI companyDeepSeek ’s R1 .
Model of the week
A Chinese AI company named Stepfun has released an “ open ” AI simulation , Step - Audio , that can understand and bring forth speech in several languages . Step - Audio underpin Chinese , English , and Japanese and lets users line up the emotion and even dialect of the synthetic audio it creates , including singing .
Stepfun is one of several well - funded Chinese AI startups releasing models under a permissive license . Founded in 2023 , Stepfunreportedly late closeda funding round worth several hundred million dollar mark from a host of investors that include Chinese country - owned private fairness firms .
Grab bag
Nous Research , an AI research group , hasreleasedwhat it claim is one of the first AI models that unifies reasoning and “ nonrational linguistic communication model capableness . ”
The fashion model , DeepHermes-3 Preview , can toggle on and off retentive “ chain of intellection ” for improved truth at the price of some computational ponderosity . In “ reasoning ” modality , DeepHermes-3 Preview , standardised to other reasoning AI models , “ thinks ” longer for harder problems and shows its cerebration process to arrive at the answer .
Anthropic reportedlyplans to put out an architecturally similar poser soon , and OpenAI has say such a modelling ison its close - term roadmap .