Neural Scaling Laws
Empirical power-law relationships predicting how AI model performance improves as a function of model size, dataset size, and compute budget.
Also known as: Chinchilla Scaling Laws, Kaplan Scaling Laws, AI Scaling Laws, Compute Scaling Laws
Category: AI
Tags: ai, machine-learning, large-language-models, research, scaling
Explanation
Neural scaling laws are empirical relationships that describe how the performance of neural networks—measured by loss on held-out data—improves predictably as model size (parameters), dataset size (tokens), and compute budget (FLOPs) increase. These laws follow power-law relationships, meaning that performance improves as a straight line on a log-log plot.
**Key findings:**
**Kaplan et al. (2020) - OpenAI:**
- Loss scales as a power law with model size, dataset size, and compute
- Larger models are more sample-efficient (learn more per data point)
- Optimal allocation: scale model size faster than dataset size
- Performance is predictable across many orders of magnitude
**Hoffmann et al. (2022) - 'Chinchilla' (DeepMind):**
- Revised optimal compute allocation: model size and training data should be scaled equally
- Many existing large models were 'over-parameterized and under-trained'
- The 70B-parameter Chinchilla model trained on more data outperformed the 280B-parameter Gopher
- Rule of thumb: ~20 tokens of training data per parameter
**What the laws predict:**
- **Smooth improvement**: For most benchmarks, performance improves smoothly and predictably with scale
- **Diminishing returns**: Each doubling of compute yields a smaller absolute improvement (though still predictable)
- **No ceiling in sight**: Within observed ranges, no plateaus have been found—though the rate of improvement decreases
- **Cross-task generality**: Scaling laws hold across different tasks, languages, and modalities
**Implications:**
- **Resource planning**: Organizations can predict performance improvements before investing in expensive training runs
- **Architecture decisions**: Scaling laws help choose between larger models vs. more data vs. more compute
- **Competitive dynamics**: They explain why AI development is increasingly concentrated among organizations with massive compute budgets
- **Research direction**: The 'bitter lesson' (Rich Sutton)—general methods that leverage computation tend to win over clever, human-engineered approaches
**Limitations:**
- Laws describe loss reduction, not necessarily task-specific performance or safety
- Emergent abilities may not follow smooth scaling predictions
- Data quality matters as much as quantity, which scaling laws don't fully capture
- Environmental and economic costs of scaling are not addressed by the laws themselves
Related Concepts
← Back to all concepts