safety - Concepts
Explore concepts tagged with "safety"
Total concepts: 24
Concepts
- Constitutional AI - AI training method using a set of principles (constitution) to guide model behavior and self-improvement.
- AI Agent Permissions - Controls governing what actions, tools, files, and resources AI agents can access, enforcing the principle of least privilege in agentic AI systems.
- AI Alignment - Ensuring AI systems behave in accordance with human intentions and values.
- Human-out-of-the-Loop - A fully autonomous model where AI systems operate independently without human oversight or intervention in real-time decision-making.
- Human-in-the-Loop - Systems design where humans remain actively involved in AI decision-making processes.
- Plan Continuation Bias - The tendency to continue with an original course of action even when changing circumstances suggest the plan should be revised or abandoned.
- Human-on-the-Loop - A supervisory model where humans monitor AI systems and can intervene when needed, but are not required to approve every individual action or decision.
- Automation Complacency - Reduced vigilance and monitoring when relying on automated systems, leading to failure to detect errors or malfunctions.
- Red Teaming - An adversarial testing practice where a dedicated team attempts to find vulnerabilities, flaws, or failure modes in a system by simulating attacks or misuse scenarios.
- AI Guardrails - Safety constraints and boundaries built into AI systems to prevent harmful or undesired outputs.
- Microsleep - Brief, involuntary episodes of sleep lasting a few seconds that occur when a person is fatigued but trying to stay awake.
- Guardrails - Safety constraints and boundaries that control AI system behavior, preventing harmful, undesired, or out-of-scope outputs and actions.
- Barrier Analysis - A root cause analysis technique that examines what barriers should have prevented an incident and why they failed.
- Bow-Tie Analysis - A risk analysis method that visually maps the pathways from causes through a hazardous event to consequences, showing preventive and mitigative barriers.
- Swiss Cheese Model - A model illustrating how accidents occur when holes in multiple layers of defense align, allowing a hazard to pass through all barriers.
- AI Red Teaming - Systematic adversarial testing of AI systems to discover vulnerabilities, biases, and failure modes before deployment.
- AI Safety - Research and practices ensuring AI systems are beneficial and don't cause unintended harm.
- AI Trust - The confidence users and stakeholders place in AI systems to perform reliably, safely, and in alignment with their expectations and values.
- Resilience Engineering - A discipline focused on understanding how systems succeed under varying conditions and building capacity to adapt to unexpected situations.
- Reward Hacking - A failure mode in reinforcement learning where an agent exploits flaws in the reward function to achieve high reward without fulfilling the intended objective.
- Jailbreaking AI - Techniques used to bypass an AI model's safety guardrails and restrictions to produce outputs it was designed to refuse.
- AI Data Security - Protecting sensitive data when using AI systems, where every interaction including prompts, uploaded files, tool call results, and agent memory is a potential data exposure point.
- Responsible AI - A comprehensive framework for developing and deploying AI systems that are ethical, transparent, fair, accountable, safe, and beneficial to society.
- AI Oversight - The governance mechanisms, processes, and institutions designed to monitor, evaluate, and regulate AI systems throughout their lifecycle.
← Back to all concepts