We live in an era of reliability where users depend on having consistent access to services. When choosing between competing services, no feature is more important to users than reliability. But what does reliability mean?

To answer this question, we’ll break down reliability in terms of other metrics within reliability engineering: availability and maintainability. Distinguishing these terms isn’t a matter of semantics. Understanding the differences can help you better prioritize development efforts towards customer happiness.

Availability

Availability is the simplest building block of reliability. This metric describes what percentage of the time service is functioning. This is also referred to as the “uptime” of a service. Availability can be monitored by continuously querying the service and confirming responses return with expected speed and accuracy.

A service’s availability is a major component in how a user perceives the reliability. With this in mind, it can be tempting to set a goal for 100% uptime. But SRE teaches us that failure is inevitable; downtime-causing incidents will always occur outside of engineering expectations. Availability is often expressed in “nines,” representing how many decimals places the percentage of uptime can reach. Some major software companies will boast of “five nines,” or 99.999% uptime—but never 100%

Moreover, users will tolerate or even fail to notice downtime in some areas of your service. Development resources devoted to improving availability beyond expectations won’t increase customer happiness. Your service’s maintainability might need these resources instead.

Maintainability

Another major building block of reliability is maintainability. Maintainability factors into availability by describing how downtime originates and is resolved. When an incident causing downtime occurs, maintainable services can be repaired quickly. The sooner the incident is resolved, the sooner the service becomes available again.

There are two major components of maintainability: proactive and reactive.

  • Proactive maintainability involves building a codebase that can be easily understood and changed. As development progresses, issues will arise from incompatibility with existing code. If engineers are writing “spaghetti code” instead of prioritizing maintainability, issues are likely to occur and be difficult to find and solve. Proactive maintenance also includes procedures such as quality assurance and testing.
  • Reactive maintainability describes a service’s ability to be repaired after incidents. This is influenced by a service’s incident response procedures. As incidents are inevitable, great incident response and guardrails are a necessity. If incident response procedures are reliable, teams will resolve incidents quickly. Proper incident responses also foster learning to reduce recurrence. A highly maintainable service allows engineers to implement these lessons effectively

#devops #availability #site reliability engineering #site reliability #site reliability engineer #maintainability #site reliability engineering tools

Availability, Maintainability, Reliability: What's the Difference?
2.40 GEEK