AI Distillation
Training a smaller student model to replicate the behavior of a larger teacher model while maintaining performance.
Also known as: Knowledge Distillation, Model Distillation
Category: AI
Tags: ai, machine-learning, optimization, models
Explanation
Knowledge distillation is the process of transferring knowledge from a large, complex model (the "teacher") to a smaller, more efficient model (the "student"). The student learns to approximate the teacher's behavior, gaining much of its capability at a fraction of the computational cost. The technique was formalized by Geoffrey Hinton, Oriol Vinyals, and Jeff Dean in 2015.
## How it works
Instead of training the student model on hard labels (correct/incorrect), distillation uses the teacher's **soft outputs** (probability distributions across all possible answers). These soft labels contain richer information: they encode the teacher's uncertainty and the relationships between classes. A cat image might have a high probability for "cat" but also small probabilities for "dog" and "tiger," revealing learned similarity structure.
The process typically involves:
1. Train a large teacher model to high performance
2. Generate soft labels by running training data through the teacher
3. Train the student to match both the hard labels and the teacher's soft output distribution
4. A temperature parameter controls how soft the distributions are
## Why it matters
Distillation enables deploying powerful capabilities on constrained hardware. A massive model runs in the cloud to generate training data, and a small model learns from that data to run locally. This is how many lightweight models achieve disproportionate performance relative to their size.
Practical applications include:
- **Edge deployment**: running capable models on phones, embedded devices, and browsers
- **Cost reduction**: serving cheaper, faster models in production while maintaining quality
- **Latency optimization**: smaller models respond faster for real-time applications
- **Specialization**: distilling a general model into a task-specific expert
## Types of distillation
- **Response-based**: student learns from teacher's output predictions
- **Feature-based**: student learns from teacher's intermediate representations
- **Relation-based**: student learns the relationships between different layers or data points
- **Progressive distillation**: gradually reducing model size through multiple distillation steps
## Trade-offs
Distillation inherently loses nuance. A compressed model cannot do everything the original could. The student may struggle with edge cases the teacher handled well, and the degree of compression determines the performance gap. Careful evaluation is needed to ensure the distilled model meets quality thresholds for its intended use case.
Related Concepts
← Back to all concepts