Prompt Compression
Shortening prompts while preserving their effectiveness, to reduce latency, cost, and context window usage.
Also known as: Prompt Distillation, Token Compression, Prompt Minimization
Category: Techniques
Tags: ai, prompting, llm-techniques, optimization, software-development
Explanation
Prompt compression is the practice of making prompts shorter without losing the behavior they produce. Long prompts - loaded with examples, instructions, and context - work well but consume tokens, increase latency, push against context window limits, and raise costs at scale. Compression asks: what is the minimum prompt that still produces the needed quality?
Techniques range from manual to automated:
- **Manual rewriting**: Remove filler, collapse repeated instructions, prune redundant examples, replace verbose phrasing with concise directives. Often the cheapest win.
- **Summarization of context**: Replace long retrieved documents with a model-generated summary focused on what the downstream task needs.
- **Few-shot pruning**: Identify which examples actually move performance and drop the rest. One good example often beats three mediocre ones.
- **Keyword distillation**: Replace full sentences with bullet-point keywords when the model can reliably expand them.
- **Learned compression**: Tools like LLMLingua use a smaller model to remove low-information tokens from the prompt, claiming 2-20x compression with minimal quality loss.
- **Embedding-based compression**: Replace text with learned prompt embeddings that carry the same signal in fewer tokens (for models that support it).
- **Caching**: Not strictly compression, but prompt caching sidesteps the cost of repeated long prefixes entirely.
Why it matters:
- **Cost**: Token-priced APIs make every word a recurring expense.
- **Latency**: Shorter prompts process faster, improving user-perceived responsiveness.
- **Context budget**: Long documents, multi-turn history, and tool outputs all compete for the same window.
- **Signal-to-noise**: Leaner prompts often improve quality by preventing the model from over-attending to irrelevant content.
Trade-offs to watch:
- Aggressive compression can silently degrade edge-case performance; measure on a held-out evaluation set.
- Compressed prompts may be harder for humans to read and maintain - keep a canonical verbose source if possible.
- Compression that strips safety or format instructions is a common failure mode; guard those sections.
Prompt compression becomes essential once AI features are used at scale, where pennies and milliseconds per call add up fast.
Related Concepts
← Back to all concepts