Gating Network
A neural network component that learns to route inputs to the most appropriate expert sub-networks in mixture of experts architectures.
Also known as: Router Network, Expert Router, Routing Network
Category: AI
Tags: ai, machine-learning, deep-learning, architecture, models
Explanation
A gating network, also known as a router, is a learned component in mixture of experts (MoE) architectures that determines which expert sub-networks should process a given input. It acts as a traffic controller, examining each input and producing a probability distribution or selection over the available experts. This routing mechanism is what enables MoE models to achieve conditional computation, activating only relevant experts rather than the entire model for each input.
In its simplest form, a gating network is a linear layer followed by a softmax function. Given an input token representation, it produces a probability distribution across all experts. In top-k routing, the k experts with the highest gating scores are selected, and their outputs are weighted by the corresponding gate values. Most modern implementations use top-1 or top-2 routing, meaning each token is processed by just one or two experts out of potentially hundreds.
The design of the gating network has profound implications for model behavior. A well-functioning router learns to specialize experts on different types of inputs, creating a natural division of labor. Research has shown that experts in language models often specialize by topic, syntax type, or language, though the specialization patterns are not always interpretable.
Load balancing is the central challenge in gating network design. Without intervention, routers tend to exhibit a rich-get-richer dynamic where a few experts receive most tokens while others are underutilized. This expert collapse wastes model capacity. Solutions include auxiliary load-balancing losses that penalize uneven expert utilization, capacity factors that limit how many tokens each expert can process, and noise injection during training to encourage exploration.
Variants of gating networks include hash-based routing (which uses fixed hash functions instead of learned routing), expert choice routing (where experts select tokens rather than tokens selecting experts), and soft routing (where all experts contribute but with learned weights). Each approach offers different tradeoffs between routing quality, training stability, and computational efficiency.
The gating network concept extends beyond MoE to any architecture requiring learned conditional computation, including adaptive computation models, early exit networks, and dynamic neural networks that adjust their computation based on input complexity.
Related Concepts
← Back to all concepts