Kole  Haag

Kole Haag

1602925200

Availability, Maintainability, Reliability: What's the Difference?

We live in an era of reliability where users depend on having consistent access to services. When choosing between competing services, no feature is more important to users than reliability. But what does reliability mean?

To answer this question, we’ll break down reliability in terms of other metrics within reliability engineering: availability and maintainability. Distinguishing these terms isn’t a matter of semantics. Understanding the differences can help you better prioritize development efforts towards customer happiness.

Availability

Availability is the simplest building block of reliability. This metric describes what percentage of the time service is functioning. This is also referred to as the “uptime” of a service. Availability can be monitored by continuously querying the service and confirming responses return with expected speed and accuracy.

A service’s availability is a major component in how a user perceives the reliability. With this in mind, it can be tempting to set a goal for 100% uptime. But SRE teaches us that failure is inevitable; downtime-causing incidents will always occur outside of engineering expectations. Availability is often expressed in “nines,” representing how many decimals places the percentage of uptime can reach. Some major software companies will boast of “five nines,” or 99.999% uptime—but never 100%

Moreover, users will tolerate or even fail to notice downtime in some areas of your service. Development resources devoted to improving availability beyond expectations won’t increase customer happiness. Your service’s maintainability might need these resources instead.

Maintainability

Another major building block of reliability is maintainability. Maintainability factors into availability by describing how downtime originates and is resolved. When an incident causing downtime occurs, maintainable services can be repaired quickly. The sooner the incident is resolved, the sooner the service becomes available again.

There are two major components of maintainability: proactive and reactive.

  • Proactive maintainability involves building a codebase that can be easily understood and changed. As development progresses, issues will arise from incompatibility with existing code. If engineers are writing “spaghetti code” instead of prioritizing maintainability, issues are likely to occur and be difficult to find and solve. Proactive maintenance also includes procedures such as quality assurance and testing.
  • Reactive maintainability describes a service’s ability to be repaired after incidents. This is influenced by a service’s incident response procedures. As incidents are inevitable, great incident response and guardrails are a necessity. If incident response procedures are reliable, teams will resolve incidents quickly. Proper incident responses also foster learning to reduce recurrence. A highly maintainable service allows engineers to implement these lessons effectively

#devops #availability #site reliability engineering #site reliability #site reliability engineer #maintainability #site reliability engineering tools

What is GEEK

Buddha Community

Availability, Maintainability, Reliability: What's the Difference?
Kole  Haag

Kole Haag

1602925200

Availability, Maintainability, Reliability: What's the Difference?

We live in an era of reliability where users depend on having consistent access to services. When choosing between competing services, no feature is more important to users than reliability. But what does reliability mean?

To answer this question, we’ll break down reliability in terms of other metrics within reliability engineering: availability and maintainability. Distinguishing these terms isn’t a matter of semantics. Understanding the differences can help you better prioritize development efforts towards customer happiness.

Availability

Availability is the simplest building block of reliability. This metric describes what percentage of the time service is functioning. This is also referred to as the “uptime” of a service. Availability can be monitored by continuously querying the service and confirming responses return with expected speed and accuracy.

A service’s availability is a major component in how a user perceives the reliability. With this in mind, it can be tempting to set a goal for 100% uptime. But SRE teaches us that failure is inevitable; downtime-causing incidents will always occur outside of engineering expectations. Availability is often expressed in “nines,” representing how many decimals places the percentage of uptime can reach. Some major software companies will boast of “five nines,” or 99.999% uptime—but never 100%

Moreover, users will tolerate or even fail to notice downtime in some areas of your service. Development resources devoted to improving availability beyond expectations won’t increase customer happiness. Your service’s maintainability might need these resources instead.

Maintainability

Another major building block of reliability is maintainability. Maintainability factors into availability by describing how downtime originates and is resolved. When an incident causing downtime occurs, maintainable services can be repaired quickly. The sooner the incident is resolved, the sooner the service becomes available again.

There are two major components of maintainability: proactive and reactive.

  • Proactive maintainability involves building a codebase that can be easily understood and changed. As development progresses, issues will arise from incompatibility with existing code. If engineers are writing “spaghetti code” instead of prioritizing maintainability, issues are likely to occur and be difficult to find and solve. Proactive maintenance also includes procedures such as quality assurance and testing.
  • Reactive maintainability describes a service’s ability to be repaired after incidents. This is influenced by a service’s incident response procedures. As incidents are inevitable, great incident response and guardrails are a necessity. If incident response procedures are reliable, teams will resolve incidents quickly. Proper incident responses also foster learning to reduce recurrence. A highly maintainable service allows engineers to implement these lessons effectively

#devops #availability #site reliability engineering #site reliability #site reliability engineer #maintainability #site reliability engineering tools

Look Upstream to Solve Your Team's Reliability Issues

Why Upstream?

In “Upstream” by Dan Health, we explore a variety of different problems ranging from homelessness, to high school graduation rates, to the state of sidewalks in different neighborhoods within the same city. In each of these examples, Dan discusses how upstream thinking decreased downstream work. Upstream thinking is characterized as proactive, collective actions to improve outcomes rather than reactions after an issue has already occurred.

You can also apply this method to software development.

With technology moving at a breakneck pace, it’s difficult to keep up with unplanned work such as incidents and unknown unknowns that come with increasing software complexity and interdependencies. Yet, we can’t halt development. As Dan points out, “Curiosity and innovation and competitiveness push them forward, forward, forward. When it comes to innovation, there’s an accelerator but no break” (“Upstream”, pg 224).

We can’t impede innovation, but we can Dan Heath’s wisdom from upstream thinking to move away from reactive modes of work and make our teams and our systems more reliable.

Barriers to Upstream Thinking

Before we can focus on implementing upstream thinking, we should acknowledge common barriers. Dan notes the problem here: “Organizations are constantly dealing with urgent short-term problems. Planning for speculative future ones is, by definition, not urgent. As a result, it’s hard to convince people to collaborate when hardship hasn’t forced them to” (220).

This might make it feel like everything is a barrier to upstream thinking. But Dan separates these issues into three groups: problem blindness, lack of ownership, and tunneling.

Problem Blindness

Problem blindness is self-explanatory: you are unaware that you have a problem. Issues and daily grievances are brushed off as just the way things are.

Consider alert fatigue. When you’re paged so often that you begin ignoring the alerts, you’re exhibiting problem blindness. Not only are you ignoring potentially important notifications, but you’re desensitized and possibly becoming burned out.

In this situation, you might hear people say things like, “Oh, that’s just the way it is. Our alerts are noisy. You can ignore them,” or “I can’t remember the last time I got a weekend off. You’ll get used to it.” Tony Lykke faced this issue and gave a talk at SREcon America in 2019. His talk, “Fixing On-Call when Nobody Thinks it’s (Too) Broken” describes this apathy.

It’s important to grow wise to problems. If you aren’t aware of them, you can’t begin to fix them. Question the status quo. Are there problems within your organization that has been dismissed or swept under the rug? These are sources of problem blindness. As Dan says, “The escape from problem blindness begins with the shock of awareness that you’ve come to treat the abnormal as normal” (37).

#devops #reliability #site reliability engineering #site reliability #site reliability engineer

What is High Availability? A Tutorial | Liquid Web

High Availability

High availability is the description of a system designed to be fault-tolerant, highly dependable, operates continuously without intervention, or having a single point of failure. These systems are highly sought after to increase the availability and uptime required to keep an infrastructure running without issue. The following characteristics define a High Availability system.

High Availability Clustering

High-availability server clusters (aka HA Clusters) is defined as a group of servers which support applications or services that can be utilized reliably with a minimal amount of downtime. These server clusters function using a type of specialized software that utilizes redundancy to achieve mission-critical levels of five9’s uptime. Currently, approximately 60% of businesses require five9’s or greater to provide vital services for their businesses.

High availability software capitalizes on the redundant software installed on multiple systems by grouping or clustering together a group of servers focusing on a common goal in case components fail. Without this form of clustering, if the application or website crashes, the service will not be available until the servers are repaired. HA clustering addresses these situations by detecting the faults and quickly restarting or replacing the server or service or server with a new process that does not require human intervention. This is defined as a “failover” model.

The illustration below demonstrates a simple two node high availability cluster.

2nodeHAcluster

High Availability clusters are often used for mission-critical databases, data sharing, applications, and e-commerce websites spread over a network. High Availability implementations build redundancy within a cluster to remove any one single point of failure, including across multiple network connections and data storage, which can be connected redundantly via geographically diverse storage area networks.

High Availability clustered servers usually use a replication methodology called Heartbeat that is used to monitor each node’s status and health within the cluster over a private network connection. One critical circumstance all clustering software must be able to address is called split-brain, which occurs when all private internal links go down simultaneously, but the nodes in the cluster continue to run. If this occurs, every node within the cluster may incorrectly determine that all the other nodes have gone down and attempt to start services that other nodes may still be running. This condition of duplicate instances running similar services, which could cause data corruption on the system.

ha.cluster

A typical version of high availability software provides attributes that include both hardware and software redundancy. These features include:

  • The automatic detection and discovery of hardware and software components.
  • Autonomous assignment of both active and contingent roles to new elements.
  • Detection of failed software services, hardware components, and other system constructs.
  • Monitoring and notification of redundant components and when they need to be activated.
  • Ability to scale the cluster to accommodate the required changes without external intervention.

Fault tolerance

fault.tolerance

Fault tolerance is defined as the ability for a system’s infrastructure to foresee and withstand errors and provide an automatic response to those issues if encountered. The primary quality of these systems is advanced design factors, which can be called upon should a problem occur. Being able to configure an infrastructure that envisions every possible solution is a considerable task that involves the knowledge and experience to counter the multiple concerns before they occur. System architects who design such frameworks will have the methodologies which envision the means to alleviate these problems in advance, and the ability to implement these frameworks.

The following redundancy methodologies are available and should be reviewed during the initial stages of design and implementation.

  • N + 1 Model – This concept infers the sum of equipment needed (which we will refer to as ‘N’) to keep the entire framework up and running, with an additional independent component backup for each of the ‘N’ components in case of failure.
  • N + 2 Model – Similar to the N + 1 model but with an additional layer of protection if two components should fail.
  • 2N Model – This modality has a dual redundant backup for each element to ensure the system’s framework is fully functional.
  • 2N + 1 Model – Again, this model is similar to the 2N model but with a supplemental component to add a tertiary layer of protection to the system’s framework.

As models progress from Nx to 2Nx, the cost factor also increases exponentially as for truly redundant systems that require uptime. These modalities are critical for stability and availability.

Dependability and Reliability

One of the central tenants of a high availability system is uptime. Uptime is of premier importance, especially if the purpose of a system is to provide an essential service like the 911 systems that respond to emergent situations. In business, having a high availability system is required to ensure a vital service remains online. One example would be an ISP or other service that cannot tolerate a loss of function. These systems must be designed with high availability and fault tolerance to ensure reliability and availability while minimizing downtime.

Orchestrated Error Handling

Should an error occur, the system will adapt and compensate for the issue while remaining up and online. Building this type of system requires forethought and planning for the unexpected. Being able to foresee the problems in advance, and planning for their resolution is one of the main qualities of a high availability system.

Scalability

Should the system encounter an issue like a traffic spike or an increase in resource usage, the system’s ability to scale to meet those needs should be automatic and immediate. Building features like these into the system will provide the system’s ability to respond quickly to any change in the systemic functionality of the architectures processes.

Availability & Five 9’s Uptime

Five 9’s is the industry standard of measure of uptime. This measurement can be related to the system itself, the system processes within a framework, or the program operating inside an infrastructure. This estimation is often related to the program being delivered to clients in the form or a website or web application. A systems Availability can be measured as the percentage of time that systems are available by using this equation: x = (n – y) * 100/n. This formula denotes that where “n” is the total amount of minutes within a calendar month, and “y” is the amount of minutes that service is inaccessible within a calendar month. The table below outlines downtime related to the percentage of “9’s” represented.

**Availability %**90%

(“1 Nine“)99%

(“2 Nines“)99.9%

(“3 Nines“)99.99%

(“4 Nines“)99.999%

(“5 Nines“)Downtime/Year36.53 days3.65 days8.77 hours52.60 minutes5.26 minutes

As we can see, the higher the number of “9’s”, the more uptime is provided. A high availability system’s goal is to achieve a minimal amount of potential downtime to ensure the system is always available to provide the designated services.

Heartbeat

One of the main High Availability components is called Heartbeat. Heartbeat is a daemon which works with a cluster management software like Pacemaker that is designed specifically for high-availability clustering resource management. Its most important characteristics are:

  • No specific or fixed maximum number of nodes – Heartbeat can be used to build large clusters as well as elementary ones.
  • Resource monitoring: resources can be automatically restarted or moved to another node on failure.
  • A fencing mechanism needed to remove failed nodes from the cluster.
  • A refined policy-based resource management, resource inter-dependencies, and constraints.
  • A time-based rule set to allow for different policies depending on a defined timeframe.
  • A group of resource scripts (for software like Apache, DB2, Oracle, PostgreSQL, etc.) included more granular management.
  • A GUI for configuring, controlling and monitoring resources and nodes.

Cluster Architecture

**Engineered Availability **

The first segment of a highly available system is the clearly designed utilization of clustered application servers that are engineered in advance to distribute load amongst the whole cluster, which includes the ability to failover to a secondary and possibly a tertiary system.

The second division includes the need for database scalability. This entails the requirement of scaling, either horizontally or vertically, using multiple master replication, and a load balancer to improve the stability and uptime of the database.

ha cluster

#tutorials #2nx models #architecture #autonomous #availability #backups #best practice #clustering #deployment #design #disaster recovery #downtime #engineered #fault tolerance #ha cluster #heartbeat #high availability #infrastructure #monitoring #node #nx models #orchestrated #pacemaker #redundancy #reliability #replication #scalability #single point of failure #split brain #system #testing #uptime

Madelyn  Frami

Madelyn Frami

1599927180

Here are the Important Differences Between SLI, SLO, and SLA

When embarking on your SRE journey, it can seem daunting to decipher all the acronyms. What are SLOs versus SLAs? What’s the difference between SLIs and SLOs? In this blog post, we’ll cover what SLI, SLO, and SLA mean and how they contribute to your reliability goals.

What’s the Difference Between SLI, SLO, and SLA?

Below are the definitions for each of these terms, as well as a brief description. Definitions are according to the Google SRE Handbook.

SLI: “a carefully defined quantitative measure of some aspect of the level of service that is provided.”

SLIs are a quantitative measure, typically provided through your APM platform. Traditionally, these refer to either latency or availability, which are defined as response times, including queue/wait time, in milliseconds. A collection of SLIs, or composite SLIs, are a group of SLIs attributed to a larger SLO. These indicators are points on a digital user journey that contribute to customer experience and satisfaction.

When a developer sets up SLIs measuring their service, they do them in two stages:

  1. SLIs that will directly impact the customer.
  2. SLIs that directly influence the health and the availability or the latency and performance of certain services.

Once you have SLIs set up, you move into your SLOs, which are targets against your SLI.

SLO: “a target value or range of values for a service level that is measured by an SLI. A natural structure for SLOs is thus SLI ≤ target, or lower bound ≤ SLI ≤ upper bound.”

Service level objectives become the common language that companies use that allows teams to set guardrails and incentives to drive high levels of service reliability.

Today many companies operate in a constantly reactive mode. They’re reacting to NPS scores, churn, or incidents. This is an expensive, unsustainable use of time, and resources, let alone the potentially irrecoverable damage to customer satisfaction and the business. SLOs give you the objective language and measure of how to prioritize reliability work for proactive service health.

SLAs: “an explicit or implicit contract with your users that includes consequences of the meeting (or missing) the SLOs they contain.”

Service level agreements are set by the business rather than engineers, SREs, or ops. When anything happens to an SLO, SLAs kick in; they’re the actions that are taken when your SLO fails and often result in financial or contractual consequences.

#devops #site reliability engineering #site reliability #site reliability engineer #site reliability engineering tools #service level agreements

Wiley  Mayer

Wiley Mayer

1602946800

4 Signs That Software Reliability Should Be Your Top Priority

You know the companies who break away from the pack. You buy their products with prime shipping, you ride in their cars. You’ve seen them disrupt entire industries. It might seem like giants such as Amazon and Uber have always existed as towering pillars of profit, but that’s not so. What sets companies like these apart is a crucial piece of knowledge. They spotted the tipping point when reliability becomes a top priority to a software company’s success.

Pinpointing this tipping point is hard. After all, many companies can’t afford to stop shipping new features to shore up their software. Timing the transition to reliability well can launch a company ahead of the competition, and win the market (e.g. Amazon, Home Depot). But missing it can spell a company or even an industry’s doom (e.g. Barnes & Noble, Forever 21, and Gymboree in the retail apocalypse). Luckily, there are signs as you approach the tipping point. From examining over 300 companies, we’ve identified five.

Let’s break these signs down together.

1. Your Product Is Becoming a Utility

When a product is a novelty, new adopters have a generous tolerance for errors because they are buying a vision of the future. Once the product becomes a utility, though, companies start to depend on it for critical functions. In the case of Twilio, suicide prevention hotlines are dependent on their reliability. And people start to depend on the product for daily life.

Consider Amazon. This company set the new consumer expectation for e-commerce and disrupted the sales of companies like Barnes & Noble and many more. How? First of all, the store is always open and accessible from your living room couch. Second, Prime delivers all packages with 2-day shipping. Do Amazon’s users care more about drone delivery (a feature), or 2-day shipping (reliability)?

While drone delivery is a neat novelty, the commodity of fast shipping is Amazon’s bread and butter. Users want their order, on-time, no matter what it takes to get there. How else can parents count on Santa coming on Christmas Eve?

While it’s tough to pinpoint the exact moment an industry converts from novelty to utility, we can look at three early indicators according to Harvard Business School.

  1. Companies begin to compete for pricing. Instead of shrugging off a price difference in exchange for the pleasure of a novelty, customers are doing their research before buying.
  2. Companies are restructuring their finances in order to keep the same profit margin even though sales are increasing. They need to innovate to make money with higher costs. They have more employees, more maintenance expenses, more everything.
  3. Companies take a closer look at their customer base. What’s the target market? What customers don’t they want buying their product? They’ve got to make tough choices here to keep loyal customers who appreciate what the company brings to the table.

Thanks to companies like Amazon, Google, Facebook, Netflix, etc., software delivery is transitioning from a novelty to a utility, from something we like to something we need every day. People expect every service to be as responsive and available as these tech giants. As your service loses its novelty, your users will look for reliability over features.

#devops #site reliability engineering #site reliability #site reliability engineer #site reliability engineering tools