Token
A fundamental unit of text that language models process, typically representing a word, subword, or character.
Also known as: AI Token, LLM Token, Language Model Token
Category: AI
Tags: ai, nlp, fundamentals, tokens, languages
Explanation
A token is the basic unit of text that language models work with. Rather than processing raw characters or whole sentences, models break text into tokens — discrete chunks that serve as the atomic elements of language processing.
**What Tokens Look Like**:
Tokens don't always align with words:
- Common words are usually single tokens: "the", "cat", "running"
- Rare or compound words get split into multiple tokens: "tokenization" → ["token", "ization"]
- Punctuation and spaces are often separate tokens
- Numbers may be split digit by digit or in groups
**Token Sizes by Approximation**:
- 1 token ≈ 4 characters in English
- 1 token ≈ 0.75 words in English
- 100 tokens ≈ 75 words
- 1,000 tokens ≈ 750 words (about 1.5 pages)
These ratios vary by language — Chinese, Japanese, and Korean text often requires more tokens per equivalent meaning than English.
**Why Tokens Matter**:
- **Context window limits**: Models can only process a fixed number of tokens at once (e.g., 200K tokens for Claude). Both your input and the model's output count toward this limit.
- **Cost**: API pricing is typically per token, so token-dense content (code, technical text) costs more than casual prose.
- **Speed**: More tokens means longer generation time and higher latency.
- **Model vocabulary**: Each model has a fixed vocabulary of tokens (typically 30K–100K unique tokens) determined during training.
**Token vs. Word vs. Character**:
| Unit | Example for "unhappiness" | Count |
|------|--------------------------|-------|
| Characters | u, n, h, a, p, p, i, n, e, s, s | 11 |
| Tokens | "un", "happiness" | 2 |
| Words | "unhappiness" | 1 |
Tokens strike a balance: they're more meaningful than characters (capturing morphological patterns) but more flexible than words (handling any text, including novel words).
**Practical Implications for Users**:
- Monitor token usage when working with AI APIs to manage costs
- Be aware that long prompts with extensive context consume tokens that could be used for output
- Code and structured data tend to be more token-dense than natural language
- Tools like tiktoken (OpenAI) or the Anthropic tokenizer let you count tokens before sending requests
Related Concepts
← Back to all concepts