Reward Hacking - Graph View A failure mode in reinforcement learning where an agent exploits flaws in the reward function to achieve high reward without fulfilling the intended objective. View concept details Related ConceptsReinforcement Learning from Human Feedback (RLHF) Reinforcement Learning Reward Model AI Alignment AI Safety AI Guardrails Goodhart's Law ← Back to full graph