Reinforcement Learning from Human Feedback (RLHF)
A training technique that aligns LLM outputs with human preferences by using human feedback to guide model behavior.
Also known as: RLHF, Human Feedback Training, AI Alignment Training
Category: Techniques
Tags: ai, machine-learning, alignment, training, human-feedback
Explanation
Reinforcement Learning from Human Feedback (RLHF) is a crucial technique for making Large Language Models helpful, harmless, and honest. It bridges the gap between raw language modeling capability and useful, aligned AI assistance.
The RLHF process typically involves three stages:
1. **Supervised Fine-Tuning (SFT)**
- Human trainers provide example conversations
- The model learns to mimic desired response patterns
- Creates a baseline for helpful behavior
2. **Reward Model Training**
- The model generates multiple responses to prompts
- Human evaluators rank responses by quality
- A reward model learns to predict human preferences
3. **Policy Optimization**
- The LLM is fine-tuned to maximize the reward model's scores
- Uses algorithms like Proximal Policy Optimization (PPO)
- Balances reward maximization with staying close to the base model
What RLHF accomplishes:
- Reduces harmful or biased outputs
- Improves helpfulness and relevance
- Makes models follow instructions better
- Aligns outputs with human values and expectations
Important considerations:
- Human feedback can introduce bias
- Reward hacking is possible (optimizing for scores, not quality)
- The quality of human evaluators matters significantly
- Different cultures and individuals may have different preferences
RLHF is what transforms a base language model into an AI assistant people actually want to use.
Related Concepts
← Back to all concepts