Fault Tolerance
The ability of a system to continue operating correctly even when some of its components fail.
Also known as: Failover, High availability, Resilient systems
Category: Software Development
Tags: software-design, resilience, systems-design, best-practices, failures
Explanation
Fault tolerance is a design property that enables a system to continue operating properly in the event of the failure of one or more of its components. A fault-tolerant system is designed to handle faults without service interruption or degradation, often through redundancy and automatic failover mechanisms. The goal is to make failures invisible to end users.
Fault tolerance is achieved through several strategies: redundancy (having backup components ready to take over), replication (maintaining multiple copies of data or services), diversity (using different implementations to avoid common-mode failures), and isolation (preventing failures from cascading). Hardware approaches include redundant power supplies, RAID storage, and clustered servers. Software approaches include replicated databases, load balancing, and circuit breakers.
Key concepts in fault tolerance include: Mean Time Between Failures (MTBF) measuring reliability, Mean Time To Recovery (MTTR) measuring repair speed, availability as a percentage of uptime, and the distinction between fail-safe (defaulting to a safe state) and fail-operational (continuing to function) designs.
Fault tolerance exists on a spectrum from simple retry logic to sophisticated distributed consensus protocols. The appropriate level depends on the cost of failure versus the cost of tolerance mechanisms. Critical systems like medical devices, aircraft controls, and financial trading platforms require extensive fault tolerance, while consumer applications may tolerate occasional failures.
Fault tolerance differs from graceful degradation (which accepts reduced functionality) and fail-fast (which makes failures visible immediately). In practice, well-designed systems combine all three: fault tolerance for common failures, graceful degradation for severe ones, and fail-fast for programming errors.
Related Concepts
← Back to all concepts