Resiliency isn’t something that just happens; it’s a result of dedication and hard work. To reach your optimal state of resilience, there are some crucial SRE best practices you should adopt to strengthen your processes.

Increase Cognitive Capacity With Runbooks

As you know, failure is not an option… because actually, it’s inevitable. Things will go wrong, especially with growing systems complexity and reliance on third-party service providers. You’ll need to be prepared to make the right decisions fast. There’s nothing worse than being called in the wee hours of a Sunday morning to handle a situation where thousands of dollars are going down the drain every second. Your brain is foggy, and you’ll likely need time to adjust to the extreme pressure of a critical incident. In these cases (and really, all cases where an incident is involved), incident runbooks can help guide you through the process and maximize the use of time.

According to Chris Taylor at Taksati Consulting, good incident runbooks help you cover all your bases. They typically include flowcharts and checklists to depict both the big picture and the minute details, a RACI (responsible, accountable, consulted, informed) chart for each step, and a list of environmental influences that are unique to your system. To create your incident playbook, Chris recommends aggregating the following information:

  • An inventory of relevant tools.
  • The right personnel/subject matter experts to engage in response.
  • Knowing the problem to solve, or the workflow you’re trying to document.
  • Current state (whether this is a new process, or updating and old one).

By developing incident runbooks and practicing running through them, you’ll be more prepared for the inevitable.

Set SLOs to Guide Change Management

Change management is often done haphazardly, if at all. This means that organizations are unable to manage the risk of pushing new code, possibly leading to more incidents. Rather than employ ITIL’s arduous CAB method, SRE seeks to empower teams to push code according to their own schedule while still managing risk. To do this, SRE uses SLOs and error budgets.

SLOs, or service level objectives, are internal goals for service availability and speed that are set according to customer needs. These SLOs serve as a benchmark for safety. Each month, you have a certain allowable amount of downtime determined by your SLO. You can use this downtime to push new features.

If a feature is at risk for exceeding your error budget, it cannot be pushed until the next window. If the feature is low to no risk to your SLO, then you can push it. Each month teams should aspire to use the entirety, but not exceed, their error budgets. This way, your organization can optimize for innovation, but do so safely without risking unacceptable levels of customer impact.

Plan Ahead for Fluctuations in Capacity

Black Friday outages, scaling, moving to cloud. All of these big events required heightened capacity planning. If you don’t have enough load balancers on Black Friday or Cyber Monday, you might be sunk. Or, if your company is simply growing quickly, you’ll need to adopt best practices to make sure that your team has everything it needs to be successful. There are two types of demand that require additional capacity: the first is organic demand (this is your organization’s natural growth) and inorganic demand (this is the growth that happens due to a marketing campaign or holiday. To prepare for these events, you’ll need to forecast the demand and plan time for acquisition.

Important facets of capacity planning include regular load testing and accurate provisioning. Regular load testing allows you to see how your system is operating under the average strain of daily users. As Google SRE Stephen Thorne writes, “It’s important to know that when you reach boundary conditions (such as CPU starvation or memory limits) things can go catastrophic, so sometimes it’s important to know where those limits are.” If your service is struggling to load balance, or the CPU usage is through the roof, you know that you’ll need to add capacity in the event of increased demand. That’s where provisioning comes in.

Adding capacity in any form can be expensive, so knowing where you need additional resources is key. It’s important to routinely plan for inorganic demand so you have time to provision correctly. The process of adding capacity can sometimes be a lengthy effort, especially if it’s the case of moving to cloud. You’ll also need to know how many hands you’ll need on deck for these momentous occasions.

Resiliency doesn’t just exist in your processes — it also exists in your people. Capacity planning is an important part of having a resilient system because in thinking about the allocation of resources, your team members matter. They need time off for holidays, personal vacations, and the obligatory annual cold. When you fail to plan for time off, you won’t have enough hands on deck to handle incidents as they occur. Denying people time off is obviously not the answer, as that will only lead to burnout and churn. So it’s important to develop a capacity plan that can accommodate people being, well, people.

Johann Strasser shares four steps you can take to develop a capacity plan that will eliminate staffing insecurity:

  1. Establish all necessary processes with the appropriate staff – from top management to team leaders. Decide how often you will need to revise/revisit this process and make sure that everyone is in agreement on this.
  2. Provide for complete and up-to-date project data and prioritize your projects. What projects are the most important, and which can be put on the back burner for now? Additionally, how long will each project take? You’ll need this data to be able to move forward with accurate plans.
  3. Identify the capacities across your existing team, as well as your infrastructure and services. Is the team equipped and system architected in a way that minimizes performance regressions, to protect efficiency and capacity?
  4. Consolidate the requirements (step 2) and the capacities (step 3). Identify underload as well as overload and try to balance them.

So, now you’ve got the people and the process, but how can you learn and improve on your resilience? For that, you’ll need great retrospective practices in place that facilitate real introspection, psychological safety, and forward-looking accountability.

#devops #sre #resilience #runbooks

Reduce Engineering Problems With a Resiliency Mindset - DZone DevOps
1.25 GEEK