Tokenization
Breaking text into smaller units (tokens) that AI models can process.
Also known as: Token splitting, Text tokenization, Subword tokenization
Category: Concepts
Tags: ai, nlp, processing, fundamentals, languages
Explanation
Tokenization is the process of breaking text into smaller units called tokens that language models can process. Tokens aren't exactly words - they might be whole words, subwords, characters, or punctuation. Example: 'Tokenization is important' might become ['Token', 'ization', 'is', 'important']. Why it matters for users: context limits are measured in tokens (not words), token count affects API costs, and some languages tokenize differently (requiring more tokens for same content). Common tokenization facts: 1 token ≈ 4 characters or 0.75 words in English, common words are single tokens, rare words split into subwords, and code often uses many tokens. Why subword tokenization: handles unknown words (can always break into known subwords), balances vocabulary size with expressiveness, and enables multilingual models. Practical implications: long documents may exceed context limits, token-dense content (code, technical text) costs more, and token boundaries can affect model behavior. Tools exist to count tokens before sending to APIs. For knowledge workers, understanding tokenization helps: estimate costs and feasibility, understand context limits, and optimize prompts for efficiency.
Related Concepts
← Back to all concepts