Happy Monday! Welcome back to The Coder Cafe! This week, we will be exploring the theme of reliability fundamentals. I will explain today what reliability means to me.
Reliability is a term that is widely used today. Yet, the more a term is used, the more ambiguous it can become.
In this issue, I’m not going to formally define the concept of reliability, but instead, I’m going to share a story that changed my whole career.
A few years ago, I joined a new company in a safety-critical domain: air traffic management. My first day there is one I will probably remember for the rest of my life. We were doing a training session with all the newcomers, seated in a large conference room and casually waiting for the session to begin. In front of us was the trainer.
After a brief introduction, the trainer asked us to make a roundtable to explain where we came from. People took turns explaining their backgrounds. For example, when it was my turn, I mentioned that I came from the insurance industry. Once everyone had shared his experience, the trainer paused for a moment and then said:
There’s something important that you should all realize by joining our company: if we have a problem, we may not lose money, we may not lose customers, but we may lose lives.
That very moment was a revelation for me. It was the trigger to understand how reliability can be something absolutely crucial in some domains. From that day forward, I started to become captivated by reliability topics.
After this experience, I continued my career, moving through different industries and eventually joining Google as an SRE. Over the years, I came to a realization: even in non-safety critical industries, reliability remains critical.
Imagine you storing photos on a cloud. You’re storing all your family photos for years—your child, your partner, your grandparents, whoever. One day, you log in, and you realize that everything is gone. No more albums, no more photos, nothing. How would you feel? Knowing that a huge part of your memories might be lost forever.
This scenario reinforced my conviction: in most contexts, reliability is one of the most important features, if not the most important one. A system might have plenty of great and shiny features, but if it’s not reliable, people will quickly lose faith in it.
The goal of this post is not to give you a fancy definition of reliability. For me, the definition is that simple: from the perspective of users and customers1, it just works.
Systems may be backed by different characteristics—performance, scalability, resilience, durability, fault tolerance—but all of these exist to serve one purpose: ensuring that the system works consistently and dependably. In other words, they all contribute to reliability.
In the end, whether it’s about guiding airplanes safely or safeguarding a picture representing a cherished memory, reliability isn’t just a feature among others; it’s what everything else depends on. That’s how I view reliability.
Tomorrow, we will explore the concept of graceful degradation.
Customers = paying users.
I like your definition. For your information, when air traffic controllers system has an outage, to keep safety they have to clear the sky. They follow a procedure for that. Calling each pilots and transferring the control to adjacent countries. A big mess. Unfortunately, it happened twice in 2024 in Switzerland…