Model Quantization
A technique for reducing the numerical precision of a neural network's weights and activations to decrease model size, memory usage, and inference latency.
Also known as: Quantization, Weight Quantization, Neural Network Quantization
Category: AI
Tags: ai, machine-learning, optimization, performance, models
Explanation
Model Quantization is a model compression technique that converts the high-precision floating-point numbers (typically FP32 or FP16) used in neural network weights and activations into lower-precision formats (INT8, INT4, or even binary). This trades a small amount of accuracy for significant gains in speed, memory, and power efficiency.
**Why Quantize?**:
- **Smaller models**: A model quantized from FP32 to INT8 is roughly 4x smaller
- **Faster inference**: Lower-precision arithmetic is faster on most hardware
- **Lower power consumption**: Critical for mobile and edge devices
- **Lower cost**: Smaller models require less expensive hardware to serve
- **Enables deployment**: Makes large models feasible on resource-constrained devices
**Types of Quantization**:
1. **Post-Training Quantization (PTQ)**: Applied after training is complete. No retraining needed. Fastest to implement but may lose more accuracy.
- Dynamic quantization: Weights quantized statically, activations quantized at runtime
- Static quantization: Both weights and activations quantized using calibration data
2. **Quantization-Aware Training (QAT)**: Simulates quantization during training so the model learns to compensate. Produces better accuracy but requires retraining.
3. **Mixed-Precision Quantization**: Uses different precision levels for different layers based on their sensitivity. Balances accuracy and efficiency.
**Common Precision Levels**:
- **FP32** (Full precision): 32-bit floating-point, standard training precision
- **FP16/BF16** (Half precision): 16-bit, common for GPU inference
- **INT8**: 8-bit integer, 4x compression, widely supported
- **INT4/NF4**: 4-bit, 8x compression, popular for LLM inference (GPTQ, GGUF)
- **Binary/Ternary**: Extreme compression, significant accuracy loss
**Quantization for LLMs**:
With the rise of large language models, quantization has become essential for practical deployment. Techniques like GPTQ, AWQ, and GGML/GGUF enable running models with billions of parameters on consumer hardware. A 70B parameter model in FP16 requires ~140GB of memory, but in 4-bit quantization it fits in ~35GB.
**Trade-offs**:
- Lower precision means some loss of accuracy (usually small for INT8, more noticeable for INT4)
- Not all operations quantize equally well; attention layers are often more sensitive
- Hardware support varies; not all chips accelerate all precision formats
- Calibration data quality affects post-training quantization results
Related Concepts
← Back to all concepts