Duane  Purdy

Duane Purdy

1626102960

What is Site Reliability Engineering (SRE)?

Learn more about SRE → http://ibm.biz/guide-to-sre
Learn more about DevOps → http://ibm.biz/guide-to-devops
Watch “DevOps vs. SRE” lightboard video → https://youtu.be/KCzNd3StIoU
Earn a badge with FREE interactive Kubernetes labs → http://ibm.biz/learn-k8s-browser-based-labs

Check out IBM Cloud Pak for Watson AIOps → http://ibm.biz/watson-aiops-cloud-pak

Software development is constantly becoming faster and more complex which can be difficult for IT operations teams to keep up with. Luckily DevOps emerged to help teams follow a set of practices where they can better collaborate together and shorten their development lifecycle while providing continuous delivery.

However, with all the improvements DevOps has provided organizations, they still did not have a dedicated person to focus on developing software systems that help improve their site reliability and performance.

In this lightboard video, Bradley Knapp with IBM Cloud, breaks down how a Site Reliability Engineer, or SRE, is taking on this needed role for organizations to help them better manage systems, solve problems, and automate operations tasks.

Get started on IBM Cloud at no cost → http://ibm.biz/free-tier-acct
Subscribe to see more videos like this in the future → http://youtube.com/user/IBMCloud?sub_confirmation=1​

#SiteReliabilityEngineer #SRE #DevOps

#sitereliabilityengineer #sre #devops

What is GEEK

Buddha Community

What is Site Reliability Engineering (SRE)?
Wilford  Pagac

Wilford Pagac

1599937200

How to Build Your SRE Team

As you implement SRE practices and culture at your organization, you’ll realize everyone has a part to play. From engineers setting SLOs to management upholding the virtue of blamelessness to marketing teams conducting retrospectives on email campaigns, there’s no part of an organization that doesn’t benefit from the SRE mentality.

However, while it’s not necessary to have people with the title of ‘SRE’ to successfully adopt the best practices of SRE, having people who are dedicated to stewardship of SRE practices is important to achieve reliability excellence. In this blog post, we’ll look at some of the many roles an SRE can play, and how to find people with those skill sets.

#devops #teams #site reliability engineering #site reliability #site reliability engineer #site reliability engineering tools

Iliana  Welch

Iliana Welch

1598403960

What Is a Kubernetes Operator and Why it Matters for SRE

Kubernetes is an open-source project that “containerizes” workloads and services and manages deployment and configurations. Released by Google in 2015, Kubernetes is now maintained by the  Cloud Native Computing Foundation. Since its release, it has become a worldwide phenomenon. The majority of cloud-native  companies use it, SaaS vendors offer commercial prebuilt versions, and there’s even an annual  convention!

What has made Kubernetes become such a fundamental service? A major factor is its automation capabilities. Kubernetes can automatically make changes to the configuration of deployed containers or even deploy new containers based on metrics it tracks or requests made by engineers. Having Kubernetes handle these processes saves time, eliminates toil, and increases consistency.

If these benefits sound familiar, it might be because they overlap with the philosophies of SRE. But how do you incorporate the automation of Kubernetes into your SRE practices? In this blog post, we’ll explain the Kubernetes Operator—the Kubernetes function at the heart of customized automation—and discuss how it can evolve your SRE solution.

What the Kubernetes Operator Can Do

In Kubernetes Operators: Automating the Container Orchestration Platform, authors Jason Dobies and Joshua Wood describe an Operator as “an automated Site Reliability Engineer for its application.” Given an SRE’s multifaceted experience and diverse workload, this is a bold statement. So what exactly can the Operator do?

#tutorial #devops #kubernetes #site reliability engineering #site reliability #site reliability engineer #site reliability engineering tools #kubernetes operators #kubernetes operator

Wiley  Mayer

Wiley Mayer

1602954000

Guide For Implementing SRE In NOCs

Network Operation Centers, or NOCs, serve as hubs for monitoring and incident response. A NOC is usually a physical location in an organization. NOC operators sit at a central desk with screens showing current service data. But, the functionality of a NOC can be distributed. Some organizations build virtual NOCs. These can be staffed fully remotely. This allows for distributed teams and follow-the-sun rotations. NOC as a service is another structure gaining in popularity. This is where the NOC is outsourced to a third party that offers it as a service similar to other infrastructure tools.

As IT services become more fragmented, shifting to virtual NOCs becomes more popular. These structures are far removed from the traditional big desk model, but their functions are the same. Any system where operators are able to monitor for incidents and respond to them can serve as a NOC.

The goals of NOC operators and SREs are aligned. Both try to improve the reliability of the system. In fact, SRE best practices applied to the NOC structure can take reliability to a new level. In this blog post, we’ll look at how SRE can improve NOC functions such as system monitoring, triage and escalation, incident response procedure, and ticketing.

Monitor Smarter by Focusing On Complex Metrics

The traditional image of a NOC is a huge grid of monitors showing every detail of the service’s data. A team of operators watches like hawks, catching any warning signs of incidents and responding. This system has several advantages. The completeness of the data displayed ensures nothing is missed. Also, having eyes on glass at all times promotes timely responses.

The SRE perspective on monitoring is different. The system monitors and alerts on metrics that have customer impact. These metrics are Service Level Indicators or SLIs. Instead of human observers, monitoring tools send alerts when these metrics hit thresholds. After iteration, these systems can be more reliable than a human observer. Yet, this doesn’t mean incidents won’t slip through the cracks. SRE teaches us that failure in any system is inevitable. Especially for organizations with multiple operating models, a mix of legacy and modern technologies, and the need to ensure governance and control, human observers in a NOC as another layer of monitoring may continue to be deeply essential.

To achieve the best of both worlds of your NOC and SRE practices, you’ll need to understand what response each of your metrics requires. For simple metrics that you can pull directly from system data, automated responses can save toil for your NOC operators. More nuanced metrics where an expert’s judgment may be necessary can be discussed in the NOC. This allows operators to focus on where their expertise is necessary. Monitoring tools handle the rest.

Escalate and Triage With Classification and On-Call

When a NOC operator notices an incident, their typical mode of operation is to first triage and try to remediate the issue via runbooks and existing documentation. They determine the severity and service area of the incident. Based on this, they escalate and engage the correct people for the incident response. In a traditional NOC structure, there’s a dedicated on-call team for incident response.

In the SRE world, things become less siloed. Incident classification applies across the organization. The developers most closely involved with each service area are also responsible for on-call shifts, rather than laying that responsibility squarely on dedicated on-call teams. NOC operators can collaborate with engineers on developing fair and effective on-call schedules. Yet NOC procedures for alerting don’t need to change. All of the infrastructures set up to alert and escalate will still apply. SRE only increases the range and effectiveness of these alerts by involving more experts. As service complexity grows, ensuring that a wide variety of experts can respond to incidents is essential.

#devops #site reliability engineering #site reliability #site reliability engineer #site reliability engineering tools #noc as a service #network operations center

Kole  Haag

Kole Haag

1602925200

Availability, Maintainability, Reliability: What's the Difference?

We live in an era of reliability where users depend on having consistent access to services. When choosing between competing services, no feature is more important to users than reliability. But what does reliability mean?

To answer this question, we’ll break down reliability in terms of other metrics within reliability engineering: availability and maintainability. Distinguishing these terms isn’t a matter of semantics. Understanding the differences can help you better prioritize development efforts towards customer happiness.

Availability

Availability is the simplest building block of reliability. This metric describes what percentage of the time service is functioning. This is also referred to as the “uptime” of a service. Availability can be monitored by continuously querying the service and confirming responses return with expected speed and accuracy.

A service’s availability is a major component in how a user perceives the reliability. With this in mind, it can be tempting to set a goal for 100% uptime. But SRE teaches us that failure is inevitable; downtime-causing incidents will always occur outside of engineering expectations. Availability is often expressed in “nines,” representing how many decimals places the percentage of uptime can reach. Some major software companies will boast of “five nines,” or 99.999% uptime—but never 100%

Moreover, users will tolerate or even fail to notice downtime in some areas of your service. Development resources devoted to improving availability beyond expectations won’t increase customer happiness. Your service’s maintainability might need these resources instead.

Maintainability

Another major building block of reliability is maintainability. Maintainability factors into availability by describing how downtime originates and is resolved. When an incident causing downtime occurs, maintainable services can be repaired quickly. The sooner the incident is resolved, the sooner the service becomes available again.

There are two major components of maintainability: proactive and reactive.

  • Proactive maintainability involves building a codebase that can be easily understood and changed. As development progresses, issues will arise from incompatibility with existing code. If engineers are writing “spaghetti code” instead of prioritizing maintainability, issues are likely to occur and be difficult to find and solve. Proactive maintenance also includes procedures such as quality assurance and testing.
  • Reactive maintainability describes a service’s ability to be repaired after incidents. This is influenced by a service’s incident response procedures. As incidents are inevitable, great incident response and guardrails are a necessity. If incident response procedures are reliable, teams will resolve incidents quickly. Proper incident responses also foster learning to reduce recurrence. A highly maintainable service allows engineers to implement these lessons effectively

#devops #availability #site reliability engineering #site reliability #site reliability engineer #maintainability #site reliability engineering tools

Wiley  Mayer

Wiley Mayer

1602946800

4 Signs That Software Reliability Should Be Your Top Priority

You know the companies who break away from the pack. You buy their products with prime shipping, you ride in their cars. You’ve seen them disrupt entire industries. It might seem like giants such as Amazon and Uber have always existed as towering pillars of profit, but that’s not so. What sets companies like these apart is a crucial piece of knowledge. They spotted the tipping point when reliability becomes a top priority to a software company’s success.

Pinpointing this tipping point is hard. After all, many companies can’t afford to stop shipping new features to shore up their software. Timing the transition to reliability well can launch a company ahead of the competition, and win the market (e.g. Amazon, Home Depot). But missing it can spell a company or even an industry’s doom (e.g. Barnes & Noble, Forever 21, and Gymboree in the retail apocalypse). Luckily, there are signs as you approach the tipping point. From examining over 300 companies, we’ve identified five.

Let’s break these signs down together.

1. Your Product Is Becoming a Utility

When a product is a novelty, new adopters have a generous tolerance for errors because they are buying a vision of the future. Once the product becomes a utility, though, companies start to depend on it for critical functions. In the case of Twilio, suicide prevention hotlines are dependent on their reliability. And people start to depend on the product for daily life.

Consider Amazon. This company set the new consumer expectation for e-commerce and disrupted the sales of companies like Barnes & Noble and many more. How? First of all, the store is always open and accessible from your living room couch. Second, Prime delivers all packages with 2-day shipping. Do Amazon’s users care more about drone delivery (a feature), or 2-day shipping (reliability)?

While drone delivery is a neat novelty, the commodity of fast shipping is Amazon’s bread and butter. Users want their order, on-time, no matter what it takes to get there. How else can parents count on Santa coming on Christmas Eve?

While it’s tough to pinpoint the exact moment an industry converts from novelty to utility, we can look at three early indicators according to Harvard Business School.

  1. Companies begin to compete for pricing. Instead of shrugging off a price difference in exchange for the pleasure of a novelty, customers are doing their research before buying.
  2. Companies are restructuring their finances in order to keep the same profit margin even though sales are increasing. They need to innovate to make money with higher costs. They have more employees, more maintenance expenses, more everything.
  3. Companies take a closer look at their customer base. What’s the target market? What customers don’t they want buying their product? They’ve got to make tough choices here to keep loyal customers who appreciate what the company brings to the table.

Thanks to companies like Amazon, Google, Facebook, Netflix, etc., software delivery is transitioning from a novelty to a utility, from something we like to something we need every day. People expect every service to be as responsive and available as these tech giants. As your service loses its novelty, your users will look for reliability over features.

#devops #site reliability engineering #site reliability #site reliability engineer #site reliability engineering tools