AI Tokenization
Process of breaking text into tokens that AI models use as their fundamental units of input and output.
Also known as: Tokenization, AI Tokens, Tokens
Category: AI
Tags: ai, machine-learning, technologies, fundamentals
Explanation
AI Tokenization is the process of converting raw text into numerical tokens that large language models can process. Because neural networks operate on numbers rather than characters or words, every piece of text must first be split into discrete units called tokens before it can enter the model.
**How Tokenization Works**
Modern tokenizers use subword algorithms that strike a balance between character-level and word-level representations:
- **Byte Pair Encoding (BPE)**: Iteratively merges the most frequent pairs of characters or subwords. Used by GPT models and many others.
- **SentencePiece**: A language-agnostic tokenizer that operates directly on raw text without pre-tokenization. Used by LLaMA and T5.
- **WordPiece**: Similar to BPE but uses a likelihood-based merging strategy. Used by BERT.
Common words typically map to a single token, while rare or complex words get split into multiple subword tokens. For English text, a rough average is about 4 characters per token, but this varies significantly for code, non-Latin scripts, and specialized terminology.
**Why Tokenization Matters**
Token count directly determines cost, context window usage, and processing time. Every input and output token counts against the model's token budget, making efficient tokenization a first-order concern for both cost and capability.
Different models use different tokenizers, so the same text may produce different token counts across models. This means cost and context usage are not directly comparable between providers without accounting for tokenizer differences.
**Structural Biases**
Tokenization has downstream consequences that are often overlooked. Rare words and non-Latin scripts are split into more tokens, which means they cost more, consume more context, and can receive lower-quality representations. This creates a structural bias in both cost and capability that disproportionately affects non-English languages and specialized domains.
Embeddings are computed per token, meaning tokenization choices directly affect the model's internal representations and, ultimately, its understanding of the input.
Related Concepts
← Back to all concepts