Speculative Decoding
An inference acceleration technique where a smaller draft model proposes multiple tokens that a larger target model verifies in parallel, speeding up generation without changing output quality.
Also known as: Speculative Sampling, Draft-and-Verify Decoding, Assisted Generation
Category: AI
Tags: ai, machine-learning, optimization, performance, models
Explanation
Speculative Decoding is an inference optimization technique for autoregressive language models that achieves significant speedups without sacrificing output quality. It exploits the asymmetry between generating and verifying tokens: while generating each token requires a full forward pass, verifying multiple proposed tokens can be done in a single forward pass.
**How It Works**:
1. **Draft Phase**: A small, fast draft model generates K candidate tokens autoregressively (cheap, quick forward passes)
2. **Verification Phase**: The large target model processes all K draft tokens in a single forward pass (parallel verification)
3. **Accept/Reject**: Draft tokens are accepted if they match what the target model would have generated. The first rejected token is resampled from the target model's distribution
4. **Repeat**: The process continues from the last accepted position
**Why It Works**:
- Many tokens in a sequence are 'easy' (predictable), and the draft model gets them right
- The target model verifies K tokens in roughly the same time as generating 1 token
- On average, multiple tokens are accepted per verification step, yielding 2-3x speedup
- The output distribution is mathematically identical to the target model alone - no quality loss
**Key Properties**:
- **Lossless**: The output distribution is exactly the same as standard decoding from the target model
- **Speedup varies**: More predictable text (code, formulaic writing) sees larger gains; creative text sees smaller gains
- **Draft model choice**: The draft model must be much faster than the target but similar enough to have high acceptance rates
**Variants**:
- **Self-Speculative Decoding**: Using earlier layers of the same model as the draft (no separate model needed)
- **Medusa**: Adding multiple prediction heads to generate draft tokens without a separate model
- **Lookahead Decoding**: Using Jacobi iteration to parallelize token generation
- **Eagle**: Combining feature-level draft generation with token verification
**Practical Impact**:
Speculative decoding has become a key technique for production LLM serving, reducing the cost and latency of generating long outputs. It is particularly valuable for code generation, long-form writing, and other tasks where the output is often predictable. Major AI providers use variants of speculative decoding in their inference infrastructure.
**Limitations**:
- Requires a compatible draft model or architectural modification
- Memory overhead of loading two models simultaneously
- Speedup depends on text predictability and acceptance rate
- Batch serving can complicate implementation
Related Concepts
← Back to all concepts