Graceful Degradation

Preventing Complete System Failures

Nov 05, 2024

Hello! Today, we will discuss a fundamental concept in reliability: graceful degradation.

Complex systems regularly encounter unexpected events—whether it’s a sudden peak of requests or external dependency problems. When these events occur, a service can deliberately reduce its quality of service to avoid a complete failure. This is known as graceful degradation.

Let’s delve into a concrete example from my experience at a previous company. We had a classic service that exposed a REST API over HTTP, connected to a database through a connection pool. To prevent overwhelming the database, each service instance was configured to have a pool of 100 connections.

Here’s what happened during an unexpected traffic surge:

Unexpected load spike: Suddenly, the service started to face a heavy load—due to a spike in user requests.

Connection pool limit reached: Each request required a database connection, and since the number of requests exceeded the pool’s capacity, new requests began to queue up.

Autoscaling triggered: Kubernetes autoscaling kicked in and spun up new instances of the service to handle the increased load.

Database overload: Although autoscaling helped handle more incoming requests, the new service instances also needed database connections, pushing the total number of connections beyond what the database could handle. This caused slower database responses, and ultimately, some requests started to time out.

Poor user experience: Eventually, all the incoming requests started to time out, resulting in a frustrating and poor user experience.

In this scenario, one potential solution would have been to deliberately reject or throttle1 some incoming traffic, a technique known as load shedding. While it may seem unfair to reject some user requests while allowing others to get through, this is the core principle of graceful degradation: giving some response is better than no response. By shedding load early, the service could have avoided overwhelming the database and kept serving a portion of users.

Another example of graceful degradation is a search system that, in the event of an unexpected issue such as a service dependency failure, would prioritize availability over accuracy by returning fast results instead of no results, even if those results are less good.

Graceful degradation is a critical strategy for building robust services. By planning for graceful degradation, we can improve how our systems handle unexpected events, reducing the risk of complete outages and maintaining service for a portion of users. In exceptional situations, some service is always better than no service.

Tomorrow, we will explore another way to handle graceful degradation.

Reject = direct refusal, throttle = controlled refusal (e.g., rate limiting per user or delaying when a user request is processed).

Graceful Degradation

Preventing Complete System Failures

Comments