Tokenization

Definition

The process of breaking text into smaller units called tokens — words, subwords, or characters — that serve as the fundamental input units for language models and NLP systems.

In Depth

Tokenization is the first step in how language models process text. Before any neural network computation, raw text must be split into discrete units called tokens. Early NLP systems tokenized text into whole words, but modern language models use subword tokenization algorithms — such as Byte-Pair Encoding (BPE), WordPiece, or SentencePiece — that split text into a vocabulary of frequently occurring character sequences. Common words like 'the' become single tokens, while rare words like 'tokenization' might be split into 'token' + 'ization.'

Subword tokenization elegantly solves the vocabulary problem. A word-level tokenizer would need an impossibly large vocabulary to handle every word, including misspellings, neologisms, technical terms, and compound words in languages like German. A character-level tokenizer would produce very long sequences, making processing expensive. Subword tokenization strikes a balance: the vocabulary is compact (typically 30,000-100,000 tokens), common words are kept whole, and rare words are composed from known subword pieces, ensuring the model can handle any input text.

Tokenization has direct practical consequences. Model costs are often measured in tokens (not words), and the same text requires different numbers of tokens depending on the tokenizer — GPT-4's tokenizer is different from Claude's. Code, non-English languages, and mathematical notation often tokenize inefficiently, requiring more tokens per character. The context window of a model (e.g., 128K tokens) is measured in tokens, meaning tokenization efficiency determines how much text fits in a single prompt. Understanding tokenization is essential for optimizing both cost and performance when working with LLMs.

Key Takeaway

Tokenization converts text into the subword units that language models actually process — it determines vocabulary coverage, processing efficiency, and cost, making it a practical concern for anyone using LLMs.

Real-World Applications

01 LLM input processing: every text prompt sent to GPT-4, Claude, or Gemini is tokenized before the model processes it, directly affecting cost and context usage.

02 Multilingual NLP: subword tokenizers like SentencePiece enable a single model to handle hundreds of languages by learning a shared subword vocabulary.

03 Cost estimation: API pricing is per-token, so understanding tokenization helps predict and optimize costs for large-scale LLM applications.

04 Search engines: tokenization is the first step in indexing and querying text, determining how search terms map to stored documents.

05 Code generation: programming languages tokenize differently than natural language — specialized tokenizers improve model performance on code tasks.

In Depth

Real-World Applications

Related Concepts