Model Scaling
The study and practice of increasing neural network size, data, or compute to improve model performance, guided by empirical scaling laws.
Also known as: Neural Scaling, Compute Scaling
Category: AI
Tags: ai, machine-learning, deep-learning, optimization, performance, models
Explanation
Model scaling refers to the systematic approach of increasing the size of neural networks, their training data, or the computational resources used during training to achieve better performance. It is one of the most important concepts in modern AI, as empirical evidence has shown that larger models trained on more data with more compute tend to perform predictably better across a wide range of tasks.
The foundation of model scaling is the discovery of scaling laws, which describe mathematical relationships between model size, dataset size, compute budget, and performance. The most influential work in this area came from researchers at OpenAI (Kaplan et al., 2020) and later from Google DeepMind's Chinchilla paper (Hoffmann et al., 2022). These studies revealed that model performance improves as a power law with increases in parameters, data, and compute, and that there are optimal ratios for allocating a fixed compute budget between model size and training data.
Scaling can happen along several dimensions. Parameter scaling increases the number of weights in the model by adding more layers (depth), widening existing layers, or adding more attention heads. Data scaling involves training on larger and more diverse datasets. Compute scaling means using more GPU/TPU hours for training. The Chinchilla scaling laws suggest that many early large language models were undertrained relative to their size, and that a smaller model trained on more data can outperform a larger model trained on less data for the same compute budget.
Efficient scaling techniques have emerged to manage the prohibitive costs of simply making everything bigger. Mixture of experts enables parameter scaling without proportional compute increases. Knowledge distillation transfers capabilities from large models to smaller ones. Model pruning and quantization reduce the resource requirements of already-trained large models. These techniques allow practitioners to achieve scaling benefits while managing practical constraints.
The implications of scaling laws are profound for the AI industry. They enable researchers to predict model performance before expensive training runs, plan compute investments, and make informed architecture choices. However, scaling also raises concerns about environmental impact, resource concentration among well-funded labs, and the question of whether scaling alone can lead to artificial general intelligence or whether fundamental architectural innovations are also needed.
Related Concepts
← Back to all concepts