Does on-call have to be so dreadful? No way. Here are five best practices to help your team respond quicker and build more resilient systems.
On-call: you may see it as a necessary evil. When fast incident response can make or break your reputation, designating people across the team to be ready to react at all hours of the day is a necessity. But, this often creates immense stress while eating into personal lives. It isn’t a surprise that many engineers have horror stories about the difficulty of carrying a pager.
But does on-call have to be so dreadful? No way. Here are five best practices to help your team respond quicker and build more resilient systems.
Not all incidents are created equal. On-call escalations should only start when it’s worth getting out of bed for. The monitorable metrics, from which you can trigger alerts, might be too low-level to capture the actual severity of an incident. Instead, consider the impact different types of incidents have on your customers. Create severity tiers based on this.
To determine impact, use techniques such as user journeys (where metrics are consolidated based on typical usage patterns) and black box monitoring (where metrics are gathered only using what external customers can see). These will help you break down an incident into specific metrics you’ll monitor to trigger alerts. This also helps you cut out metrics that only make things noisier.
Once you have your metric, make sure your team is in agreement on how to classify incidents and what response each class requires. Schedule time to review these choices based on retrospectives of previous incidents. Was that Sev 0 actually a Sev 0? Does a Sev 3 need all those people alerted? Your classification system should be logical and consistent.
Knowing the difference between a Sev 0 and a Sev 3 incident can save you from opening your laptop at 2 AM. It can also save you from underestimating a critical, customer-facing incident.
Imagine an incident that is crucial enough to rouse a team member in the wee hours of the morning. What can your team do to help them resolve the incident and get back to bed as fast as possible? The answer is a runbook.
A runbook is a set of detailed instructions for resolving each type of incident. This guidance helps ease the cognitive burden of on-call troubleshooting. It also contains specific commands to execute or places in code to check.
In this blog post, we’ll break down reliability in terms of other metrics within reliability engineering: availability and maintainability.
Let's break these signs down together. Your Product Is Becoming a Utility. Your Users Are Demanding Reliability Over New Features. New Contracts Have Tighter SLAs (B2B) / Customers Are Getting Less Patient (B2C) Spaghetti Code Is Now Easier To Refactor Than To Fix.
In this blog post, we’ll walk you through holistic measures and best practices for understanding the operational health of your systems.
We’ll look at three big benefits of investing in reliability and explain how you can get started on your journey to reliability excellence.
In this blog post, we’ll cover what SLI, SLO, and SLA mean and how they contribute to your reliability goals.