AI KV Cache
Key-value caching mechanism that stores previously computed attention states to speed up sequential token generation.
Also known as: KV Cache, Key-Value Cache, Attention Cache
Category: AI
Tags: ai, machine-learning, performance, optimization
Explanation
The KV (Key-Value) Cache is an optimization technique used in transformer-based language models that stores previously computed key and value tensors from the attention mechanism during autoregressive text generation. Without it, the model would need to recompute attention over all previous tokens for every new token generated, making generation prohibitively slow.
**How It Works**
During autoregressive generation, a transformer model produces one token at a time. Each new token needs to attend to all previous tokens via the attention mechanism, which requires key and value vectors for every prior position. The KV cache stores these vectors as they are computed, so they only need to be calculated once.
This trades memory for speed: it makes generation O(n) instead of O(n^2) per token, where n is the sequence length. The improvement is dramatic for long sequences.
**The Memory Challenge**
The KV cache grows linearly with sequence length and is proportional to the number of attention layers and heads in the model. For long contexts, this can consume tens of gigabytes of GPU memory. A model might support 128K tokens in theory, but the KV cache memory required to serve that context at reasonable batch sizes can be prohibitive.
KV cache size is often the primary constraint on long-context inference, not the model weights themselves.
**Optimization Techniques**
- **Paged Attention** (used by vLLM): Treats the KV cache like virtual memory, eliminating fragmentation and enabling more efficient memory utilization.
- **Sliding Window Attention**: Only caches the most recent N tokens, bounding memory usage at the cost of losing access to distant context.
- **KV Cache Quantization**: Stores keys and values in lower precision (e.g., FP8 instead of FP16) to reduce memory footprint.
- **Multi-Query / Grouped-Query Attention (MQA/GQA)**: Shares key-value heads across multiple query heads, significantly reducing the cache size per layer.
The KV cache is directly relevant to context window management, context compression strategies, and the practical economics of serving large language models at scale.
Related Concepts
← Back to all concepts