Mixture of Experts
A neural network architecture that uses a gating network to route inputs to specialized sub-networks called experts, enabling efficient scaling by activating only a subset of parameters for each input.
Also known as: MoE, Sparse Mixture of Experts, SMoE
Category: AI
Tags: ai, machine-learning, deep-learning, architecture, optimization, models
Explanation
Mixture of Experts (MoE) is a machine learning architecture that divides a complex problem into smaller sub-problems, each handled by a specialized sub-network called an expert. A learned gating network examines each input and decides which experts should process it, enabling the model to activate only a fraction of its total parameters for any given input. This conditional computation makes MoE models significantly more efficient than dense models of equivalent capacity.
The concept was first introduced by Robert Jacobs, Michael Jordan, Steven Nowlan, and Geoffrey Hinton in 1991. The original formulation used a competitive learning framework where experts specialized on different regions of the input space, coordinated by a gating network that produced a soft assignment over experts. This early work laid the groundwork for decades of research into modular and conditional computation.
Modern MoE architectures, particularly in large language models, typically replace the feed-forward layers in transformer blocks with MoE layers. Each MoE layer contains multiple expert feed-forward networks and a router (gating network) that selects a small number of experts (often 1 or 2) for each input token. This sparse activation pattern means that while the model may have hundreds of billions of total parameters, only a fraction are used for any single forward pass, dramatically reducing computational cost.
Google's Switch Transformer (2021) demonstrated that scaling to thousands of experts was feasible, achieving significant speedups over dense models. Mixtral 8x7B by Mistral AI showed that open-source MoE models could match or exceed much larger dense models. DeepSeek's MoE models further pushed the boundaries of efficient scaling. These implementations proved that MoE is one of the most practical approaches to building very large models without proportionally increasing inference cost.
A key challenge in MoE training is load balancing. Without careful design, the gating network may collapse to routing most tokens to a small number of experts, leaving others underutilized. Auxiliary loss functions and capacity constraints are commonly used to encourage balanced expert utilization. Other challenges include communication overhead in distributed training, difficulty in fine-tuning, and the large memory footprint despite sparse computation.
MoE represents a fundamental shift in how we think about model scaling. Rather than making every parameter participate in every computation, MoE enables conditional computation where model capacity can grow while keeping inference cost manageable. This principle is central to building the next generation of efficient, high-capability AI systems.
Related Concepts
← Back to all concepts