Direct Preference Optimization - Graph View A simplified alternative to RLHF that fine-tunes language models directly on human preference data without training a separate reward model. View concept details Related ConceptsReinforcement Learning from Human Feedback (RLHF) Reinforcement Learning Reward Model Fine-Tuning AI Alignment Constitutional AI Instruction Tuning Large Language Models (LLMs) ← Back to full graph