alignment - Concepts

Explore concepts tagged with "alignment"

Total concepts: 9

Concepts

Instruction Tuning - A fine-tuning technique that trains language models to follow natural language instructions by learning from examples of instruction-response pairs.
Reinforcement Learning from Human Feedback (RLHF) - A training technique that aligns LLM outputs with human preferences by using human feedback to guide model behavior.
Reward Model - A neural network trained to predict human preferences, used to provide a scalar reward signal for optimizing language model behavior in RLHF.
Constitutional AI - AI training method using a set of principles (constitution) to guide model behavior and self-improvement.
Reward Hacking - A failure mode in reinforcement learning where an agent exploits flaws in the reward function to achieve high reward without fulfilling the intended objective.
Direct Preference Optimization - A simplified alternative to RLHF that fine-tunes language models directly on human preference data without training a separate reward model.
Team Charter - A document defining a team's purpose, goals, roles, and operating principles.
Shared Vision - A common understanding of the future that a team wants to create together, serving as a powerful tool for alignment and motivation.
Shared Understanding - Common knowledge, perspectives, and mental models that enable effective team collaboration.