Topics
previous
AI
Amazon
Image Credits:Getty Images
Apps
Biotech & Health
Climate
Cloud Computing
Commerce
Crypto
Enterprise
EVs
Fintech
Fundraising
Gadgets
game
Government & Policy
computer hardware
Layoffs
Media & Entertainment
Meta
Microsoft
Privacy
Robotics
Security
Social
Space
startup
TikTok
DoT
Venture
More from TechCrunch
effect
Startup Battlefield
StrictlyVC
Podcasts
video
Partner Content
TechCrunch Brand Studio
Crunchboard
Contact Us
Generative AI modelling do n’t physical process text the same elbow room humans do . Understanding their “ token”-based intragroup environs may help excuse some of their foreign behaviors — and stubborn limitations .
Most models , from small on - gadget one like Gemma to OpenAI ’s industry - leading GPT-4o , are built on an computer architecture known as the transformer . Due to the way transformers conjure up associations between schoolbook and other eccentric of data , they ca n’t take in or output raw school text — at least not without a monolithic amount of compute .
So , for reasons both hardheaded and expert , today ’s transformer models work out with text edition that ’s been broken down into smaller , pungency - sized pieces called token — a mental process know as tokenization .
Tokens can be words , like “ fantastic . ” Or they can be syllables , like “ fan , ” “ tas ” and “ tic . ” Depending on the tokenizer — the mannequin that does the tokenizing — they might even be individual character in words ( for instance , “ f , ” “ a , ” “ n , ” “ metric ton , ” “ a , ” “ s , ” “ t , ” “ i , ” “ c ” ) .
Using this method , transformer can take in more entropy ( in the semantic horse sense ) before they reach an upper demarcation line sleep with as the context windowpane . But tokenization can also acquaint biases .
Some relic have odd spacing , which canderaila transformer . A tokenizer might encode “ once upon a time ” as “ once , ” “ upon , ” “ a , ” “ clock time , ” for good example , while encoding “ once upon a ” ( which has a trailing whitespace ) as “ once , ” “ upon , ” “ a , ” ” . ” calculate on how a model is prompted — with “ once upon a ” or “ once upon a , ” — the result may be all different , because the model does n’t understand ( as a person would ) that the substance is the same .
Tokenizers treat compositor’s case differently , too . “ Hello ” is n’t needs the same as “ HELLO ” to a model ; “ hello ” is ordinarily one item ( depending on the tokenizer ) , while “ HELLO ” can be as many as three ( “ HE , ” “ El , ” and “ O ” ) . That ’s why many transformers fail thecapital letter test .
Join us at TechCrunch Sessions: AI
Exhibit at TechCrunch Sessions: AI
“ It ’s kind of operose to get around the question of what exactly a ‘ tidings ’ should be for a spoken communication modelling , and even if we got human experts to accord on a arrant token vocabulary , model would credibly still regain it useful to ‘ chunk ’ things even further , ” Sheridan Feucht , a PhD student take large linguistic communication simulation interpretability at Northeastern University , told TechCrunch . “ My guess would be that there ’s no such thing as a perfect tokenizer due to this kind of fuzziness . ”
This “ blurriness ” create even more job in languages other than English .
Many tokenization methods assume that a space in a sentence denotes a novel word . That ’s because they were designed with English in nous . But not all languages habituate spaces to separate words . Chinese and Japanese do n’t — nor do Korean , Thai or Khmer .
A 2023 Oxfordstudyfound that , because of difference of opinion in the elbow room non - English language are tokenized , it can take a transformer twice as farseeing to make out a task phrased in a non - English linguistic process versus the same labor articulate in English . The same study — andanother — found that users of less “ token - effective ” linguistic communication are likely to see worse model performance yet pay more for usage , give that many AI vendors burster per token .
Tokenizers often cover each character in logographic system of penning — systems in which print symbols represent Good Book without relating to pronunciation , like Chinese — as a trenchant item , leading to high nominal enumeration . likewise , tokenizers processing agglutinative languages — oral communication where words are made up of small meaningful Christian Bible element call morphemes , such as Turkish — tend to turn each morpheme into a token , increase overall token numeration . ( The equivalent word for “ hello ” in Thai , สวัสดี , is six token . )
In 2023 , Google DeepMind AI researcher Yennie Junconductedan analysis liken the tokenization of unlike spoken communication and its downstream effects . Using a dataset of parallel texts translated into 52 languages , Jun showed that some languages needed up to 10 times more token to capture the same meaning in English .
Beyond language inequities , tokenization might explicate why today ’s models are bad atmath .
seldom are finger tokenized consistently . Because theydon’t really know what numbers are , tokenizers might treat “ 380 ” as one tokenish , but represent “ 381 ” as a distich ( “ 38 ” and “ 1 ” ) — effectivelydestroying the relationshipsbetween finger and final result in equations and formula . The final result is transformer confusion ; a recentpapershowed that example struggle to understand repetitive numerical pattern and context , particularly temporal information . ( See : GPT-4thinks7,735 is greater than 7,926 ) .
That ’s also the grounds modelsaren’t great at solving anagram problemsorreversing words .
We will see that a lot of weird behaviors and problems of LLMs actually decipher back to tokenization . We ’ll go through a number of these issues , discuss why tokenization is at fault , and why someone out there ideally finds a way to erase this stage entirely.pic.twitter.com/5haV7FvbBx
So , tokenization clearly presents challenge for reproductive AI . Can they be solved ?
Maybe .
Feucht points to “ byte - level ” state outer space models likeMambaByte , which can ingest far more data than transformers without a operation penalty by doing away with tokenization entirely . MambaByte , which work directly with raw byte represent school text and other data , is competitive with some transformer model on language - examine tasks while good handling “ noise ” like words with swap theatrical role , spacing and capitalized characters .
Models like MambaByte are in the former inquiry stage , however .
“ It ’s likely good to let models seem at character directly without enforce tokenization , but right on now that ’s just computationally impracticable for transformer , ” Feucht said . “ For transformer models in particular , computation scale quadratically with successiveness distance , and so we really want to use brusk text representations . ”
Barring a tokenization breakthrough , it seems raw model architectures will be the key .