Chaos Engineering
The discipline of experimenting on distributed systems to build confidence in their ability to withstand turbulent conditions.
Also known as: Chaos Monkey, Resilience Testing, Failure Injection
Category: Software Development
Tags: reliability, distributed-systems, testing, software-engineering, resilience
Explanation
Chaos Engineering is the discipline of experimenting on a system to build confidence in its capability to withstand turbulent conditions in production. Pioneered by Netflix with their 'Chaos Monkey' tool, it involves deliberately introducing failures to test resilience. Key principles include: (1) Build a hypothesis around steady state - define what normal looks like, (2) Vary real-world events - simulate failures like server crashes, network issues, or dependency failures, (3) Run experiments in production - test where it matters most (with safeguards), (4) Automate experiments - continuous chaos testing catches regressions, (5) Minimize blast radius - start small and expand. Benefits include: discovering weaknesses before they cause outages, improving incident response, and building team confidence. Common tools include Chaos Monkey, Gremlin, and LitmusChaos. The practice embodies the philosophy that understanding failure modes is better than hoping they won't happen.
Related Concepts
← Back to all concepts