Resilience Engineering
A discipline focused on understanding how systems succeed under varying conditions and building capacity to adapt to unexpected situations.
Also known as: Safety-II, Adaptive safety management
Category: Software Development
Tags: resilience, systems-thinking, safety, engineering
Explanation
Resilience engineering is a paradigm for safety management that focuses on how to help systems cope with complexity and variability, rather than simply trying to prevent failures. Originating from the work of Erik Hollnagel, David Woods, and others, it shifts the focus from 'what goes wrong' to 'what goes right' and how to sustain that.
**Core principles**:
- **Systems succeed more than they fail**: Most of the time, complex systems work well despite imperfect conditions. Understanding why things go right is as important as understanding failures
- **Safety is a dynamic property**: It is not a static state but something that must be continuously created through adaptive behavior
- **Humans as a source of resilience**: People are not just sources of error — they are the primary source of adaptation and recovery in complex systems
- **Complexity and coupling**: Modern systems are too complex to fully predict, so building adaptive capacity is more effective than trying to eliminate all possible failures
**The four cornerstones of resilience**:
1. **Anticipating**: Looking ahead to identify potential challenges and opportunities
2. **Monitoring**: Knowing what to look for and recognizing when conditions are changing
3. **Responding**: Knowing what to do and being able to adjust to actual or anticipated disruptions
4. **Learning**: Knowing what has happened and extracting lessons from both success and failure
**Resilience engineering vs. traditional safety**:
- Traditional safety: Focuses on counting failures, finding root causes, and adding barriers
- Resilience engineering: Focuses on understanding adaptation, building capacity for surprise, and enabling graceful extensibility
**Practical applications**:
- **Aviation**: Crew Resource Management and adaptive cockpit procedures
- **Healthcare**: Designing systems that support clinician judgment rather than just adding protocols
- **Software engineering**: Chaos engineering, game days, and blameless post-mortems
- **Organizations**: Building learning cultures that improve through experience
**Key insight**: You cannot protect a system against every possible failure. Instead, build the capacity to adapt, respond, and recover from whatever happens — including things you never anticipated.
Related Concepts
← Back to all concepts