Reward Model
A neural network trained to predict human preferences, used to provide a scalar reward signal for optimizing language model behavior in RLHF.
Also known as: RM, Preference Model
Category: AI
Tags: ai, machine-learning, alignment, training, models
Explanation
A reward model is a neural network that learns to predict how humans would evaluate AI-generated outputs, producing a scalar score that serves as a proxy for human judgment. It is a critical component in Reinforcement Learning from Human Feedback (RLHF), where it replaces the need for continuous human evaluation during the policy optimization phase by providing an automated, scalable reward signal.
Reward models are trained on comparison data collected from human evaluators. Annotators are presented with pairs of model outputs for the same prompt and asked to indicate which response they prefer. These preference comparisons are used to train the reward model via the Bradley-Terry model or similar ranking frameworks. The trained reward model can then score any model output, enabling the RL optimization loop to run without requiring a human in the loop for every generated response.
The architecture of a reward model is typically based on the same transformer foundation as the language model it evaluates. A common approach takes a pretrained language model, removes the language modeling head, and adds a scalar output head that produces a single reward score. The model processes the prompt and response together and outputs a score reflecting the predicted quality of the response.
Reward model quality is perhaps the single most important factor in RLHF success. A reward model that accurately captures human preferences enables the language model to improve meaningfully. A flawed reward model, however, can lead to reward hacking, where the language model learns to produce outputs that score highly on the reward model without actually being good responses. Common failure modes include rewarding verbosity over substance, favoring confident-sounding but incorrect answers, or being gamed through specific stylistic patterns.
To improve reward model robustness, researchers use techniques like ensembling multiple reward models, training on diverse annotator pools, iteratively updating the reward model as the policy improves (to prevent distribution shift), and using process-based reward models that evaluate each step of reasoning rather than just the final output.
The reward model concept extends beyond RLHF. Constitutional AI uses reward models that evaluate outputs against a set of principles. Reward models are also used in best-of-n sampling (generating multiple outputs and selecting the highest-scoring one), automated evaluation of AI systems, and as classifiers for content safety filtering. The development of more accurate and robust reward models remains a key research direction in AI alignment.
Related Concepts
← Back to all concepts