alignment - Concepts
Explore concepts tagged with "alignment"
Total concepts: 9
Concepts
- Instruction Tuning - A fine-tuning technique that trains language models to follow natural language instructions by learning from examples of instruction-response pairs.
- Reinforcement Learning from Human Feedback (RLHF) - A training technique that aligns LLM outputs with human preferences by using human feedback to guide model behavior.
- Reward Model - A neural network trained to predict human preferences, used to provide a scalar reward signal for optimizing language model behavior in RLHF.
- Constitutional AI - AI training method using a set of principles (constitution) to guide model behavior and self-improvement.
- Reward Hacking - A failure mode in reinforcement learning where an agent exploits flaws in the reward function to achieve high reward without fulfilling the intended objective.
- Direct Preference Optimization - A simplified alternative to RLHF that fine-tunes language models directly on human preference data without training a separate reward model.
- Team Charter - A document defining a team's purpose, goals, roles, and operating principles.
- Shared Vision - A common understanding of the future that a team wants to create together, serving as a powerful tool for alignment and motivation.
- Shared Understanding - Common knowledge, perspectives, and mental models that enable effective team collaboration.
← Back to all concepts