Next-Token Prediction
The core mechanism of autoregressive language models that generates text by predicting the most likely next token given all preceding tokens.
Also known as: Next Word Prediction, Causal Language Modeling, Autoregressive Prediction
Category: AI
Tags: ai, machine-learning, fundamentals, generation, models
Explanation
Next-token prediction is the fundamental task that underpins modern large language models. The model is trained to answer one deceptively simple question: given a sequence of tokens, what token is most likely to come next?
**The Core Idea**:
Given the sequence: "The cat sat on the"
The model assigns probabilities to every token in its vocabulary:
- "mat" → 15%
- "floor" → 12%
- "bed" → 8%
- "roof" → 3%
- ... (thousands more with smaller probabilities)
This probability distribution is computed using the model's learned parameters, which encode patterns from its training data.
**Why It Works So Well**:
Next-token prediction seems like a narrow task, but learning to do it well requires the model to develop rich internal representations of:
- **Grammar and syntax**: Understanding sentence structure to predict grammatically correct continuations
- **Semantics**: Understanding meaning to predict contextually appropriate words
- **World knowledge**: Knowing facts about the world to predict accurate statements
- **Reasoning**: Following logical chains to predict correct conclusions
- **Style and tone**: Matching the register and voice of the preceding text
As Ilya Sutskever noted, predicting the next token well enough requires understanding the processes that generated the text — effectively requiring the model to build a world model.
**From Prediction to Generation**:
Text generation is simply repeated next-token prediction:
1. Process the prompt
2. Predict the probability of each possible next token
3. Sample a token from this distribution
4. Add it to the sequence
5. Repeat from step 2 with the extended sequence
**Training Objective**:
During pre-training, the model learns by:
1. Reading a sequence of tokens from training data
2. Predicting the next token at each position
3. Computing the loss (how wrong the prediction was)
4. Adjusting parameters via backpropagation to improve predictions
This is called **causal language modeling** — the model only sees tokens that come before the current position (no peeking ahead).
**Limitations**:
- Next-token prediction optimizes for statistical plausibility, not truth — leading to hallucinations
- The model has no inherent notion of correctness, only likelihood
- Performance depends entirely on the quality and breadth of training data
- The autoregressive nature means errors can compound (each token conditions all future tokens)
Related Concepts
← Back to all concepts