In this blog post, we’ll break down reliability in terms of other metrics within reliability engineering: availability and maintainability.
We live in an era of reliability where users depend on having consistent access to services. When choosing between competing services, no feature is more important to users than reliability. But what does reliability mean?
To answer this question, we’ll break down reliability in terms of other metrics within reliability engineering: availability and maintainability. Distinguishing these terms isn’t a matter of semantics. Understanding the differences can help you better prioritize development efforts towards customer happiness.
Availability is the simplest building block of reliability. This metric describes what percentage of the time service is functioning. This is also referred to as the “uptime” of a service. Availability can be monitored by continuously querying the service and confirming responses return with expected speed and accuracy.
A service’s availability is a major component in how a user perceives the reliability. With this in mind, it can be tempting to set a goal for 100% uptime. But SRE teaches us that failure is inevitable; downtime-causing incidents will always occur outside of engineering expectations. Availability is often expressed in “nines,” representing how many decimals places the percentage of uptime can reach. Some major software companies will boast of “five nines,” or 99.999% uptime—but never 100%
Moreover, users will tolerate or even fail to notice downtime in some areas of your service. Development resources devoted to improving availability beyond expectations won’t increase customer happiness. Your service’s maintainability might need these resources instead.
Another major building block of reliability is maintainability. Maintainability factors into availability by describing how downtime originates and is resolved. When an incident causing downtime occurs, maintainable services can be repaired quickly. The sooner the incident is resolved, the sooner the service becomes available again.
There are two major components of maintainability: proactive and reactive.
Let's break these signs down together. Your Product Is Becoming a Utility. Your Users Are Demanding Reliability Over New Features. New Contracts Have Tighter SLAs (B2B) / Customers Are Getting Less Patient (B2C) Spaghetti Code Is Now Easier To Refactor Than To Fix.
We’ll look at three big benefits of investing in reliability and explain how you can get started on your journey to reliability excellence.
In this blog post, we’ll cover what SLI, SLO, and SLA mean and how they contribute to your reliability goals.
In this blog post, we’ll explain the Kubernetes Operator and discuss how it can evolve your SRE solution. Kubernetes is an open-source project that “containerizes” workloads and services and manages deployment and configurations. Released by Google in 2015, Kubernetes is now maintained by the Cloud Native Computing Foundation.
In this blog post, we’ll walk you through holistic measures and best practices for understanding the operational health of your systems.