Alycia  Klein

Alycia Klein


Ansible, Terraform Excel Among Site Reliability Engineers, DevOps

Almost three-quarters of Chef users in the 2020  StackOverflow survey don’t want to use it next year. Only IBM DB2, VBA and a couple of other technologies are as “dreaded” in the industry. Puppet’s outlook isn’t that much better.

Over the last several years, Red Hat’s  Ansible and now HashiCorp’s  Terraform have risen to become two of the top tools used to deploy infrastructure as code. Ansible and Terraform are used by a third of the survey participants that describe site reliability engineering as one of their job roles. Many site reliability engineers (SREs) are clamoring to use Terraform in the upcoming year, while the other three tools are set for declines.

#devops #research #red hat’s ansible #hashicorp’s terraform

What is GEEK

Buddha Community

Ansible, Terraform Excel Among Site Reliability Engineers, DevOps
Kole  Haag

Kole Haag


Availability, Maintainability, Reliability: What's the Difference?

We live in an era of reliability where users depend on having consistent access to services. When choosing between competing services, no feature is more important to users than reliability. But what does reliability mean?

To answer this question, we’ll break down reliability in terms of other metrics within reliability engineering: availability and maintainability. Distinguishing these terms isn’t a matter of semantics. Understanding the differences can help you better prioritize development efforts towards customer happiness.


Availability is the simplest building block of reliability. This metric describes what percentage of the time service is functioning. This is also referred to as the “uptime” of a service. Availability can be monitored by continuously querying the service and confirming responses return with expected speed and accuracy.

A service’s availability is a major component in how a user perceives the reliability. With this in mind, it can be tempting to set a goal for 100% uptime. But SRE teaches us that failure is inevitable; downtime-causing incidents will always occur outside of engineering expectations. Availability is often expressed in “nines,” representing how many decimals places the percentage of uptime can reach. Some major software companies will boast of “five nines,” or 99.999% uptime—but never 100%

Moreover, users will tolerate or even fail to notice downtime in some areas of your service. Development resources devoted to improving availability beyond expectations won’t increase customer happiness. Your service’s maintainability might need these resources instead.


Another major building block of reliability is maintainability. Maintainability factors into availability by describing how downtime originates and is resolved. When an incident causing downtime occurs, maintainable services can be repaired quickly. The sooner the incident is resolved, the sooner the service becomes available again.

There are two major components of maintainability: proactive and reactive.

  • Proactive maintainability involves building a codebase that can be easily understood and changed. As development progresses, issues will arise from incompatibility with existing code. If engineers are writing “spaghetti code” instead of prioritizing maintainability, issues are likely to occur and be difficult to find and solve. Proactive maintenance also includes procedures such as quality assurance and testing.
  • Reactive maintainability describes a service’s ability to be repaired after incidents. This is influenced by a service’s incident response procedures. As incidents are inevitable, great incident response and guardrails are a necessity. If incident response procedures are reliable, teams will resolve incidents quickly. Proper incident responses also foster learning to reduce recurrence. A highly maintainable service allows engineers to implement these lessons effectively

#devops #availability #site reliability engineering #site reliability #site reliability engineer #maintainability #site reliability engineering tools

Wiley  Mayer

Wiley Mayer


4 Signs That Software Reliability Should Be Your Top Priority

You know the companies who break away from the pack. You buy their products with prime shipping, you ride in their cars. You’ve seen them disrupt entire industries. It might seem like giants such as Amazon and Uber have always existed as towering pillars of profit, but that’s not so. What sets companies like these apart is a crucial piece of knowledge. They spotted the tipping point when reliability becomes a top priority to a software company’s success.

Pinpointing this tipping point is hard. After all, many companies can’t afford to stop shipping new features to shore up their software. Timing the transition to reliability well can launch a company ahead of the competition, and win the market (e.g. Amazon, Home Depot). But missing it can spell a company or even an industry’s doom (e.g. Barnes & Noble, Forever 21, and Gymboree in the retail apocalypse). Luckily, there are signs as you approach the tipping point. From examining over 300 companies, we’ve identified five.

Let’s break these signs down together.

1. Your Product Is Becoming a Utility

When a product is a novelty, new adopters have a generous tolerance for errors because they are buying a vision of the future. Once the product becomes a utility, though, companies start to depend on it for critical functions. In the case of Twilio, suicide prevention hotlines are dependent on their reliability. And people start to depend on the product for daily life.

Consider Amazon. This company set the new consumer expectation for e-commerce and disrupted the sales of companies like Barnes & Noble and many more. How? First of all, the store is always open and accessible from your living room couch. Second, Prime delivers all packages with 2-day shipping. Do Amazon’s users care more about drone delivery (a feature), or 2-day shipping (reliability)?

While drone delivery is a neat novelty, the commodity of fast shipping is Amazon’s bread and butter. Users want their order, on-time, no matter what it takes to get there. How else can parents count on Santa coming on Christmas Eve?

While it’s tough to pinpoint the exact moment an industry converts from novelty to utility, we can look at three early indicators according to Harvard Business School.

  1. Companies begin to compete for pricing. Instead of shrugging off a price difference in exchange for the pleasure of a novelty, customers are doing their research before buying.
  2. Companies are restructuring their finances in order to keep the same profit margin even though sales are increasing. They need to innovate to make money with higher costs. They have more employees, more maintenance expenses, more everything.
  3. Companies take a closer look at their customer base. What’s the target market? What customers don’t they want buying their product? They’ve got to make tough choices here to keep loyal customers who appreciate what the company brings to the table.

Thanks to companies like Amazon, Google, Facebook, Netflix, etc., software delivery is transitioning from a novelty to a utility, from something we like to something we need every day. People expect every service to be as responsive and available as these tech giants. As your service loses its novelty, your users will look for reliability over features.

#devops #site reliability engineering #site reliability #site reliability engineer #site reliability engineering tools

Getting SRE Buy-in From a Manager or Lead for Incident Response

Adopting SRE best practices can be difficult, especially when you need approval from managers, VPs, CTOs, and more. In this blog post, we’ll walk you through crafting a winning pitch for each level of leadership to ensure that SRE buy-in will succeed in your organization. Let’s start at the beginning with your team lead or manager.

The Situation

As one of the first steps towards SRE adoption, incident management is key. You want to implement an effective incident management system within your team. Now it’s time to convince your lead/manager. How will you accomplish this?

First, we need to recognize that your manager will need a lot of support from engineering and DevOps teams for this transition. These teams will need training in this incident management system to use it each time an incident occurs.

Second, you need to define what you mean by incident management. We’ll define incident management as the assembling, investigating, resolution, and learning process. This includes incident response playbooks, measuring tim-to-detection, monitoring systems, and ticketing workflow.

Once you have a handle on the basic proposal, it’s time to think about what the team (manager included) will gain from an incident management system.

The Incentives

There are four incentives that will motivate your team to adopt incident management best practices:

  • Incident management best practices restore your systems as fast as possible when an incident occurs.
  • A playbook gives everyone a sense of control amidst the chaos. It defines a set of repeatable practices to drive consistency while helping everyone to be thorough with their problem-solving.
  • Measuring time to resolution (TTR) and time to detection (TTD) allows the manager to quantify the team’s improvement on TTR and TTD moving forward.
  • Integration with alerting and ticketing systems reduces context switching between different apps. This lowers the stress from mentally keeping track of many systems.

Yet, explaining these incentives to your manager and hoping for immediate support will not guarantee buy-in. You need to anticipate the resistance your manager will have towards this big change.

#devops #incident management #site reliability engineering #site reliability #site reliability engineer #incident response #site reliability engineering tools

Iliana  Welch

Iliana Welch


What Is a Kubernetes Operator and Why it Matters for SRE

Kubernetes is an open-source project that “containerizes” workloads and services and manages deployment and configurations. Released by Google in 2015, Kubernetes is now maintained by the  Cloud Native Computing Foundation. Since its release, it has become a worldwide phenomenon. The majority of cloud-native  companies use it, SaaS vendors offer commercial prebuilt versions, and there’s even an annual  convention!

What has made Kubernetes become such a fundamental service? A major factor is its automation capabilities. Kubernetes can automatically make changes to the configuration of deployed containers or even deploy new containers based on metrics it tracks or requests made by engineers. Having Kubernetes handle these processes saves time, eliminates toil, and increases consistency.

If these benefits sound familiar, it might be because they overlap with the philosophies of SRE. But how do you incorporate the automation of Kubernetes into your SRE practices? In this blog post, we’ll explain the Kubernetes Operator—the Kubernetes function at the heart of customized automation—and discuss how it can evolve your SRE solution.

What the Kubernetes Operator Can Do

In Kubernetes Operators: Automating the Container Orchestration Platform, authors Jason Dobies and Joshua Wood describe an Operator as “an automated Site Reliability Engineer for its application.” Given an SRE’s multifaceted experience and diverse workload, this is a bold statement. So what exactly can the Operator do?

#tutorial #devops #kubernetes #site reliability engineering #site reliability #site reliability engineer #site reliability engineering tools #kubernetes operators #kubernetes operator

Wilford  Pagac

Wilford Pagac


How to Build Your SRE Team

As you implement SRE practices and culture at your organization, you’ll realize everyone has a part to play. From engineers setting SLOs to management upholding the virtue of blamelessness to marketing teams conducting retrospectives on email campaigns, there’s no part of an organization that doesn’t benefit from the SRE mentality.

However, while it’s not necessary to have people with the title of ‘SRE’ to successfully adopt the best practices of SRE, having people who are dedicated to stewardship of SRE practices is important to achieve reliability excellence. In this blog post, we’ll look at some of the many roles an SRE can play, and how to find people with those skill sets.

#devops #teams #site reliability engineering #site reliability #site reliability engineer #site reliability engineering tools