Knowledge Distillation
A model compression technique where a smaller student model is trained to reproduce the behavior and outputs of a larger, more capable teacher model.
Also known as: Model Distillation, Teacher-Student Learning, KD
Category: AI
Tags: ai, machine-learning, optimization, models, training
Explanation
Knowledge Distillation is a machine learning technique introduced by Geoffrey Hinton in 2015, where knowledge from a large, complex model (the teacher) is transferred to a smaller, more efficient model (the student). The student learns not just from the hard labels in the training data, but from the teacher's soft probability distributions, which encode richer information about inter-class relationships.
**How It Works**:
1. **Train the teacher**: A large, high-accuracy model is trained normally on the task
2. **Generate soft targets**: The teacher produces probability distributions over outputs (not just the top prediction) at a raised temperature
3. **Train the student**: The smaller model learns from both the original labels and the teacher's soft distributions
4. **Deploy the student**: The compact model is used in production, offering most of the teacher's accuracy at a fraction of the cost
**Why Soft Targets Matter**:
When a teacher model classifies an image of a dog, it might output: dog=0.85, wolf=0.10, cat=0.04, horse=0.01. These soft probabilities contain far more information than just the label 'dog' - they reveal that dogs look somewhat like wolves, a little like cats, and not much like horses. The student learns these relationships, not just the correct answer.
**Types of Distillation**:
- **Response-based**: Student mimics teacher's output probabilities
- **Feature-based**: Student mimics teacher's intermediate representations
- **Relation-based**: Student mimics relationships between teacher's representations across samples
- **Self-distillation**: A model distills knowledge from its own deeper layers to shallower ones
**Applications**:
- **Model compression**: Deploy capable models on mobile devices or edge hardware
- **LLM distillation**: Creating smaller language models that retain much of a larger model's capability
- **Ensemble distillation**: Compressing an ensemble of models into a single model
- **Cross-modal distillation**: Transferring knowledge between different data modalities
**Distillation in the LLM Era**:
Many smaller open-source language models are distilled from larger ones. The student model might be 10-100x smaller while retaining 90%+ of the teacher's performance on key benchmarks. This has democratized access to capable AI models.
**Limitations**:
- Student can never fully match teacher performance
- Requires access to teacher model outputs (or the model itself)
- Training is more complex than standard training
- Quality depends heavily on the teacher-student architecture match
Related Concepts
← Back to all concepts