AI Speculative Decoding
Technique where a smaller draft model generates candidate tokens that a larger model verifies in parallel to speed up inference.
Also known as: Speculative Decoding, Draft-Verify Decoding
Category: AI
Tags: ai, machine-learning, performance, optimization
Explanation
Speculative decoding is an inference optimization technique that uses a small, fast "draft" model to generate candidate tokens, which a larger "target" model then verifies in parallel. Since verification is computationally cheaper than generation (multiple tokens can be checked in a single forward pass), this approach significantly speeds up inference without changing output quality.
## How It Works
The process follows a simple loop:
1. The draft model generates a sequence of candidate tokens (typically 3-8 tokens ahead)
2. The target model evaluates all candidate tokens in a single forward pass
3. The target model accepts or rejects each draft token by comparing probability distributions
4. If a token is rejected, generation continues from that point using the target model's own distribution
Because the target model makes the final decision on every token, speculative decoding is **lossless** -- it produces exactly the same output distribution as running the target model alone. It is a pure speed optimization with no quality trade-off.
## Performance
Typical speedups range from 2-3x for well-matched draft/target pairs. The actual improvement depends on the **acceptance rate**: if the draft model's predictions closely match the target model, more tokens are accepted per verification step, yielding greater speedup. Tasks where the output is more predictable (e.g., code completion, structured outputs) tend to see higher acceptance rates.
## Variants
Several variants of the technique have emerged:
- **Self-speculative decoding**: Uses earlier layers of the same model as the draft, eliminating the need for a separate smaller model
- **Medusa**: Adds extra prediction heads to the target model itself, allowing it to predict multiple future tokens simultaneously
- **Eagle**: Uses a lightweight draft head trained on the target model's hidden states for better acceptance rates
## Why It Matters
As large language models grow in size, inference latency and cost become critical bottlenecks. Speculative decoding addresses this by making better use of available hardware parallelism, reducing the time users wait for responses without sacrificing the quality that larger models provide. It is increasingly becoming a standard optimization in production LLM serving systems.
Related Concepts
← Back to all concepts