Safety and Liveness

Balancing Correctness and System Progress

Oct 31, 2024

Happy Friday! First of all, you may have experienced some availability issues yesterday when visiting thecoder.cafe website. If you did, I’m sorry about that. It was a (big) misunderstanding with Substack support, but now everything is fixed. Anyway. Today, we will explore two fundamental properties in system design and databases: safety and liveness.

In the context of concurrent systems and distributed systems, safety and liveness play crucial roles in ensuring that a system behaves correctly and efficiently:

Safety: A safety property guarantees that something bad will not happen.
Liveness: A liveness property ensures that something good will eventually happen.

Consider multiple threads sharing a mutex:

Preventing a deadlock (when threads wait forever) is a safety property—the bad thing (deadlock) must not happen.
Ensuring a thread eventually releases the lock is a liveness property—the good thing (the system progresses) will eventually happen.

Yet, these properties are also important in the context of distributed systems. For example, consider a leader election process in a database:

Ensuring that we avoid having two nodes mistakenly thinking they are the leader is a safety property.
Guaranteeing that the process eventually completes is a liveness property.

A safety property can also be referred to as an invariant, something that must be true no matter what.

There’s one specific type of liveness, which is known as fairness. Fairness refers to ensuring that every part of a system gets a fair chance to do its job over time, so no part is left waiting forever while others keep getting attention. Fairness helps us reason about things like preventing starvation or resource allocation. For example, the process scheduler in Linux is called CFS (Completely Fair Scheduler), as it ensures that all processes are given a fair amount of CPU time, preventing any process from being starved of resources.

In some systems, it’s interesting to note that there’s a tradeoff between safety and liveness:

Safety over liveness: When a system prioritizes safety, it becomes more conservative in decision-making, which can result in halting or slowing down to ensure that no unsafe operations occur. For example, in a consensus algorithm, the system might stall if it can't guarantee a safe decision due to a network partition or failure. Systems like this generally follow a fail-safe approach.
Liveness over safety: When a system prioritizes liveness, its primary focus is on making continuous progress, even if there’s a risk of reaching an unsafe or incorrect state. For example, in a distributed system where nodes must make decisions quickly to avoid delays, they might proceed with incomplete or conflicting information, potentially leading to inconsistencies. Systems like this tend to follow a fail-fast approach.

One last thing to mention: formal specification languages exist to verify that systems meet safety and liveness requirements, such as TLA+1.

Tomorrow, you will receive your weekly recap.

We will delve into TLA+ in a future issue.

Safety and Liveness

Balancing Correctness and System Progress

Comments