Fail Open vs. Fail Closed
Hello! Today, let’s explore another reliability concept: fail open vs. fail closed.
When designing reliable systems, one of the most important questions to address is how the system should behave when things go wrong. This is where the concepts of fail open and fail closed come into play:
Fail open: In the event of failures, the system continues to allow operation.
Fail closed: In the event of failures, the system stops operations.
Why does this classification matter? It determines what the system needs to prioritize during failures:
Fail open: Prioritize availability over control.
Fail closed: Prioritize control over availability.
At Google, I work on a system that evaluates disruption intentions and determines whether these disruptions should proceed or not. For example, rebooting a machine is an intention that must be validated by this service before allowing the actual reboot.
NOTE: This system is mentioned in VM Live Migration At Scale and referred to as the Safe Removal Service.
This system provides responses based on various dependencies. Now what happens if one of the critical dependencies is unavailable and the system can’t make a fully informed decision? In this case, the system fails closed by rejecting the request. This approach prioritizes control and security and ensures the system doesn’t inadvertently authorize a risky action.
The decision between fail open and fail closed depends on the system itself and the risks involved. For systems where availability is critical (e.g., e-commerce website), failing open may minimize disruptions at the cost of a temporarily degraded user experience. Conversely, systems requiring security or data integrity (e.g., healthcare or financial systems) often adopt fail closed to avoid harmful consequences.
Also, regardless of the approach, observability is essential. A fail closed system must provide clear signals explaining why requests were rejected. On the other hand, a fail open system should not hide underlying issues that could lead to catastrophic failures.
Fail open and fail closed are system design philosophies that balance availability and control, guiding how systems respond to failures based on their priorities and risks. While there’s no one-size-fits-all answer, these concepts are fundamental to keep in mind when designing reliable systems.
How do you approach these decisions in your projects? Please share your thoughts and experiences in the comments.