The process of breaking text into smaller units called tokens — words, subwords, or characters — that serve as the fundamental input units for language models and NLP systems.
In Depth
Tokenization is the first step in how language models process text. Before any neural network computation, raw text must be split into discrete units called tokens. Early NLP systems tokenized text into whole words, but modern language models use subword tokenization algorithms — such as Byte-Pair Encoding (BPE), WordPiece, or SentencePiece — that split text into a vocabulary of frequently occurring character sequences. Common words like 'the' become single tokens, while rare words like 'tokenization' might be split into 'token' + 'ization.'
Subword tokenization elegantly solves the vocabulary problem. A word-level tokenizer would need an impossibly large vocabulary to handle every word, including misspellings, neologisms, technical terms, and compound words in languages like German. A character-level tokenizer would produce very long sequences, making processing expensive. Subword tokenization strikes a balance: the vocabulary is compact (typically 30,000-100,000 tokens), common words are kept whole, and rare words are composed from known subword pieces, ensuring the model can handle any input text.
Tokenization has direct practical consequences. Model costs are often measured in tokens (not words), and the same text requires different numbers of tokens depending on the tokenizer — GPT-4's tokenizer is different from Claude's. Code, non-English languages, and mathematical notation often tokenize inefficiently, requiring more tokens per character. The context window of a model (e.g., 128K tokens) is measured in tokens, meaning tokenization efficiency determines how much text fits in a single prompt. Understanding tokenization is essential for optimizing both cost and performance when working with LLMs.
Tokenization converts text into the subword units that language models actually process — it determines vocabulary coverage, processing efficiency, and cost, making it a practical concern for anyone using LLMs.