Incident Management
The process of identifying, responding to, and resolving unplanned disruptions to restore normal service as quickly as possible.
Also known as: Incident response, IM, ITSM incident management
Category: Software Development
Tags: operations, reliability, devops, processes
Explanation
Incident management is the structured process for detecting, responding to, and resolving unplanned events or service disruptions that affect normal operations. It aims to restore normal service operation as quickly as possible while minimizing impact on business operations.
**Incident management lifecycle**:
1. **Detection**: Identifying that an incident has occurred through monitoring, alerts, or user reports
2. **Triage**: Assessing severity, impact, and urgency to prioritize response
3. **Response**: Assembling the right people and resources to address the incident
4. **Resolution**: Implementing fixes to restore normal operations
5. **Recovery**: Verifying service restoration and addressing any residual issues
6. **Post-incident review**: Learning from the incident to prevent recurrence
**Severity levels** (common model):
- **SEV-1 (Critical)**: Complete service outage affecting all users
- **SEV-2 (Major)**: Significant degradation affecting many users
- **SEV-3 (Minor)**: Partial degradation with workaround available
- **SEV-4 (Low)**: Minor issue with minimal impact
**Key roles**:
- **Incident Commander**: Coordinates the overall response effort
- **Communications Lead**: Manages stakeholder and customer communication
- **Subject Matter Experts**: Provide technical expertise for diagnosis and resolution
- **Scribe**: Documents actions, decisions, and timeline
**Best practices**:
- Define clear escalation paths and severity criteria
- Practice incident response through regular drills and game days
- Maintain runbooks for common incident types
- Conduct blameless post-incident reviews
- Track incident metrics (MTTR, MTTD, frequency) to identify trends
- Automate detection and routine response actions where possible
**Relationship to other practices**:
Incident management is closely related to problem management (addressing root causes), change management (preventing incidents from changes), and business continuity (maintaining operations during major incidents).
Related Concepts
← Back to all concepts