Problem Management
The practice of identifying and managing the underlying causes of incidents to prevent recurrence and minimize impact.
Also known as: ITIL problem management, Root cause elimination
Category: Software Development
Tags: operations, processes, reliability, devops
Explanation
Problem management is an IT service management practice focused on reducing the likelihood and impact of incidents by identifying their root causes and managing known errors. While incident management restores service quickly, problem management asks 'why did this happen?' and 'how do we prevent it from happening again?'
**Problem management vs. incident management**:
- **Incident management**: Reactive, focused on restoring service as quickly as possible. Treats symptoms
- **Problem management**: Proactive and reactive, focused on finding and eliminating root causes. Prevents recurrence
- A single problem can be the root cause of multiple incidents
**Problem management activities**:
**Reactive**:
1. **Problem identification**: Detecting recurring incidents, major incidents, or patterns through trend analysis
2. **Problem investigation**: Root cause analysis using techniques like 5 Whys, Kepner-Tregoe, or fault tree analysis
3. **Known error management**: Documenting known errors and workarounds in a Known Error Database (KEDB)
4. **Problem resolution**: Implementing permanent fixes, often through the change management process
**Proactive**:
1. **Trend analysis**: Identifying patterns in incident data before they become critical
2. **Risk assessment**: Evaluating potential problems before they manifest as incidents
3. **Preventive action**: Implementing improvements to prevent anticipated problems
**Key concepts**:
- **Known error**: A problem that has a documented root cause and a workaround or permanent fix
- **Workaround**: A temporary solution that reduces the impact of a problem until a permanent fix is available
- **Problem record**: The documentation of a problem's lifecycle from identification to resolution
**Success metrics**:
- Number of recurring incidents reduced
- Mean time to identify root cause
- Number of known errors documented
- Percentage of problems resolved within SLA
- Number of incidents prevented through proactive problem management
**Challenge**: Many organizations skip problem management because it requires dedicated time and resources beyond the urgency of incident response. However, organizations that invest in problem management typically see significant reductions in incident volume and operational costs over time.
Related Concepts
← Back to all concepts