#24: Resilient, Fault-tolerant, Robust, or Reliable?
Hey there! Do you know the difference between a system that is resilient, fault-tolerant, robust, or reliable? These terms often get used interchangeably, but each one refers to a distinct attribute of system design. Let’s explore the differences between them and why they matter.
Resilient
Definition: The ability to recover after disruption.
Analogy: Think of a rubber band. When stretched and then released, it returns to its original shape. This reflects a system’s capacity to bounce back and recover after experiencing complications or failures.
System design example: Apache Cassandra has a repair mechanism to ensure recovery from node failure. After detecting a node failure, Cassandra uses a feature called hinted handoff1 to make sure that when the failed node recovers, it receives any missed data and synchronizes with the rest of the cluster.
Fault-Tolerant
Definition: The ability to continue operating properly even when one or more of its components fail.
Analogy: Consider a commercial airplane. If the primary pilot becomes incapacitated, the co-pilot will take over and still manage to safely land the plane. This redundancy ensures that the airplane continues to operate safely despite the failure of one critical component (a pilot).
System design example: A common approach is multiple instances for the same service. For example, if one instance of a load balancer fails, other instances can continue to route traffic without interruption. Fault tolerance is commonly achieved through redundancy.
Robust
Definition: The ability to function correctly in the presence of stressful conditions.
Analogy: Consider the Golden Gate. The bridge can endure severe conditions, such as heavy traffic or difficult environmental conditions, yet it should never break.
System design example: Based on what we discussed this week, can you think of an example? I’ve added the solution as a footnote to make sure I don’t spoil the answer2.
Reliable
Definition: The ability to perform as users expect for a specific period of time.
Analogy: Consider a COSC-certified watch3. The COSC tests watches over several days to ensure they maintain precise time within specific parameters (-4 to +6 seconds daily). This certification guarantees that the watch will reliably keep time within expected parameters throughout the testing period.
System design example: A cloud datastore offering SLAs, for instance:
The system is available = “as users expect”.
99.99% of the time = “for a specific period of time”.
A reliable system is supported by other key characteristics such as resilience, fault tolerance, and robustness.
Understanding the difference between resilience, fault tolerance, robustness, and reliability is key to designing systems that can thrive under pressure. We should make sure that the differences between each characteristic are clear in our head.
Tomorrow, you will receive your weekly recap on the theme of the week: reliability fundamentals.
We will cover the topic of hinted handoff in a future issue.
Graceful degradation! A robust system will favor delivering a reduced quality of service instead of a complete system crash or failure.
The COSC (Chronometer Testing Institute) is an independent Swiss organization that certifies the accuracy and precision of mechanical watches. It ensures that a watch meets strict standards for timekeeping performance over several days.