AI Mixture of Experts
Architecture where multiple specialized sub-networks are selectively activated for different inputs to improve efficiency.
Also known as: MoE, Mixture of Experts, Sparse Mixture of Experts
Category: AI
Tags: ai, machine-learning, architectures, performance
Explanation
Mixture of Experts (MoE) is a neural network architecture where a model contains multiple specialized sub-networks called "experts" and a routing mechanism (often called a gating network) that activates only a subset of them for each input. This is **sparse activation**: the model has a very large total parameter count but only uses a fraction of them per inference step.
**How It Works**
In a typical MoE transformer layer, each token is routed to a small number of experts (often 2 out of 8 or more). The gating network learns which experts are most relevant for each input, and only those experts process the token. The outputs are then combined, weighted by the router's confidence scores.
This means a 100-billion parameter MoE model that activates 20 billion parameters per token can match or beat a 70-billion parameter dense model in quality while being significantly cheaper to run at inference time.
**Benefits**
- **Better performance per compute dollar**: More total knowledge encoded in the parameters without proportionally higher inference cost.
- **Faster inference**: Only a fraction of parameters are active per token.
- **Scalability**: MoE models can be scaled to very large parameter counts while keeping inference costs manageable.
Prominent examples include GPT-4, Mixtral, DeepSeek, and Grok, all of which use MoE architectures.
**Trade-Offs**
- **Memory**: All experts must be loaded into memory even though only some are active for each token. This makes the total memory footprint much larger than a dense model with the same active parameter count.
- **Quantization challenges**: MoE models are harder to quantize effectively because different experts may have different weight distributions.
- **Load balancing**: During training, poorly balanced routing can lead to some experts being overtrained while others are undertrained. Auxiliary loss functions and load-balancing techniques are used to mitigate this.
- **Routing complexity**: The gating mechanism adds engineering complexity and can introduce instability during training.
Related Concepts
← Back to all concepts