Model Pruning
A neural network compression technique that removes redundant or low-impact weights, neurons, or entire layers to create smaller, faster models.
Also known as: Neural Network Pruning, Weight Pruning, Network Pruning
Category: AI
Tags: ai, machine-learning, optimization, models, performance
Explanation
Model Pruning is a technique for reducing the size and computational cost of neural networks by identifying and removing components that contribute little to the model's performance. Inspired by synaptic pruning in the developing brain, it produces sparser, more efficient models.
**Core Idea**:
Neural networks are often over-parameterized - they contain far more parameters than necessary for the task. Pruning exploits this redundancy by removing the least important connections, resulting in a smaller model that performs nearly as well as the original.
**Types of Pruning**:
1. **Unstructured Pruning**: Removes individual weights (sets them to zero). Creates sparse weight matrices. Achieves high compression ratios but requires specialized hardware/software to realize speed gains.
2. **Structured Pruning**: Removes entire neurons, filters, channels, or attention heads. Produces models that run faster on standard hardware because entire computation units are eliminated.
3. **Layer Pruning**: Removes entire layers from deep networks. Most aggressive form but risks significant accuracy loss.
**Pruning Approaches**:
- **Magnitude Pruning**: Remove weights with the smallest absolute values (the simplest and most common approach)
- **Gradient-based Pruning**: Remove weights that have the least impact on the loss function
- **Sensitivity Analysis**: Prune layers or components based on their measured contribution to accuracy
- **Lottery Ticket Hypothesis**: The idea that dense networks contain sparse subnetworks (winning tickets) that can match the full network's performance when trained in isolation
**The Pruning Pipeline**:
1. Train the full model to convergence
2. Evaluate component importance (weights, neurons, heads)
3. Remove low-importance components based on a threshold or target sparsity
4. Fine-tune the pruned model to recover accuracy
5. Optionally iterate: prune more, fine-tune again
**Pruning in Practice**:
- 80-90% of weights can often be pruned with minimal accuracy loss
- Structured pruning typically achieves 2-4x speedup on standard hardware
- Pruning combines well with quantization for compounding compression
- Increasingly used for LLMs to reduce serving costs
**Limitations**:
- Aggressive pruning degrades accuracy
- Unstructured sparsity needs specialized hardware for real speedups
- Finding optimal pruning strategy often requires experimentation
- Pruned models may lose robustness or performance on edge cases
Related Concepts
← Back to all concepts