Topics
late
AI
Amazon
Image Credits:Jakub Porzycki/NurPhoto / Getty Images
Apps
Biotech & Health
mood
Cloud Computing
Commerce
Crypto
Enterprise
EVs
Fintech
fund-raise
contraption
bet on
Government & Policy
Hardware
layoff
Media & Entertainment
Meta
Microsoft
Privacy
Robotics
Security
Social
Space
Startups
TikTok
Transportation
speculation
More from TechCrunch
Events
Startup Battlefield
StrictlyVC
Podcasts
Videos
Partner Content
TechCrunch Brand Studio
Crunchboard
adjoin Us
OpenAI on Monday launch a unexampled menage of models called GPT-4.1 . Yes , “ 4.1 ” — as if the company ’s nomenclature was n’t confusing enough already .
There ’s GPT-4.1 , GPT-4.1 mini , and GPT-4.1 nano , all of which OpenAI says “ surpass ” at tantalize and instruction following . Available through OpenAI ’s API but notChatGPT , the multimodal poser have a 1 - million - token context windowpane , meaning they can take in roughly 750,000 word in one go ( longer than “ War and Peace ” ) .
GPT-4.1 arrives as OpenAI rivals like Google and Anthropic rachet up effort to build sophisticated computer programing example . Google ’s late releasedGemini 2.5 Pro , which also has a 1 - million - token linguistic context window , ranks extremely on popular coding benchmark . So do Anthropic’sClaude 3.7 Sonnetand Chinese AI startupDeepSeek ’s upgrade V3 .
It ’s the goal of many tech giants , including OpenAI , to educate AI tantalise models capable of perform complex software engineering chore . OpenAI ’s grand dream is to produce an “ agentic software system applied scientist , ” asCFO Sarah Friar put itduring a technical school summit in London last calendar month . The ship’s company asserts its succeeding models will be capable to programme entire apps end - to - final stage , manage look such as quality self-assurance , bug testing , and documentation writing .
GPT-4.1 is a whole step in this guidance .
“ We ’ve optimized GPT-4.1 for tangible - world function base on verbatim feedback to improve in areas that developer manage most about : frontend coding , pee fewer orthogonal edits , following format reliably , adhere to response structure and ordering , consistent puppet exercise , and more , ” an OpenAI voice told TechCrunch via email . “ These improvements enable developer to build agents that are substantially good at substantial - world software engineering tasks . ”
OpenAI claim the full GPT-4.1 model outperforms itsGPT-4o and GPT-4o minimodels on write in code benchmarks , include SWE - bench . GPT-4.1 mini and nano are order to be more efficient and faster at the cost of some accuracy , with OpenAI enounce GPT-4.1 nano is its speedy — and cheapest — fashion model ever .
Join us at TechCrunch Sessions: AI
Exhibit at TechCrunch Sessions: AI
GPT-4.1 be $ 2 per million input tokens and $ 8 per million turnout token . GPT-4.1 mini is $ 0.40 / million stimulant tokens and $ 1.60 / million output tokens , and GPT-4.1 nano is $ 0.10 / million input token and $ 0.40 / million output token .
harmonize to OpenAI ’s internal testing , GPT-4.1 , which can get more tokens at once than GPT-4o ( 32,768 versus 16,384 ) , scored between 52 % and 54.6 % on SWE - bench Verified , a homo - validate subset of SWE - work bench . ( OpenAI observe in a blog post that some solutions to SWE - bench Verified problems could n’t run on its infrastructure , hence the kitchen range of scores . ) Those figures are slightly under the scores reported by Google and Anthropic for Gemini 2.5 Pro ( 63.8 % ) and Claude 3.7 Sonnet ( 62.3 % ) , respectively , on the same benchmark .
In a freestanding evaluation , OpenAI probed GPT-4.1 using Video - MME , which is design to measure the ability of a model to “ empathise ” content in videos . GPT-4.1 reached a chart - topping 72 % accuracy on the “ long , no caption ” video category , claims OpenAI .
While GPT-4.1 scores reasonably well on benchmark and has a more recent “ knowledge cutoff , ” give it a better frame of reference for current issue ( up to June 2024 ) , it ’s important to keep in creative thinker that even some of the upright model today skin with tasks that would n’t activate up expert . For example , manystudieshaveshownthat computer code - give models often betray to fix , and even introduce , security vulnerabilities and bugs .
OpenAI acknowledges , too , that GPT-4.1 becomes less reliable ( i.e. , likelier to make mistakes ) the more comment tokens it has to deal with . On one of the company ’s own tests , OpenAI - MRCR , the simulation ’s accuracy decreased from around 84 % with 8,000 token to 50 % with 1 million tokens . GPT-4.1 also lean to be more “ actual ” than GPT-4o , says the society , sometimes require more specific , explicit prompting .