5 On-Call Practices To Help You Sleep Through the Night

5 On-Call Practices To Help You Sleep Through the Night

Does on-call have to be so dreadful? No way. Here are five best practices to help your team respond quicker and build more resilient systems.

On-call: you may see it as a necessary evil. When fast incident response can make or break your reputation, designating people across the team to be ready to react at all hours of the day is a necessity. But, this often creates immense stress while eating into personal lives. It isn’t a surprise that many engineers have horror stories about the difficulty of carrying a pager.

But does on-call have to be so dreadful? No way. Here are five best practices to help your team respond quicker and build more resilient systems.

Use Meaningful Severity Levels

Not all incidents are created equal. On-call escalations should only start when it’s worth getting out of bed for. The monitorable metrics, from which you can trigger alerts, might be too low-level to capture the actual severity of an incident. Instead, consider the impact different types of incidents have on your customers. Create severity tiers based on this.

To determine impact, use techniques such as user journeys (where metrics are consolidated based on typical usage patterns) and black box monitoring (where metrics are gathered only using what external customers can see). These will help you break down an incident into specific metrics you’ll monitor to trigger alerts. This also helps you cut out metrics that only make things noisier.

Once you have your metric, make sure your team is in agreement on how to classify incidents and what response each class requires. Schedule time to review these choices based on retrospectives of previous incidents. Was that Sev 0 actually a Sev 0? Does a Sev 3 need all those people alerted? Your classification system should be logical and consistent. 

Knowing the difference between a Sev 0 and a Sev 3 incident can save you from opening your laptop at 2 AM. It can also save you from underestimating a critical, customer-facing incident.

Create Detailed Runbooks

Imagine an incident that is crucial enough to rouse a team member in the wee hours of the morning. What can your team do to help them resolve the incident and get back to bed as fast as possible? The answer is a runbook.

A runbook is a set of detailed instructions for resolving each type of incident. This guidance helps ease the cognitive burden of on-call troubleshooting. It also contains specific commands to execute or places in code to check.

  • Escalating incidents - whom to notify and when
  • Assigning roles - who will handle what if things escalate
  • retrospective creation - document decisions made and communications

devops monitoring site reliability engineering site reliability site reliability engineer monitoring and alerting paging site reliability engineering tools on-call alerting and notifications

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

Availability, Maintainability, Reliability: What's the Difference?

In this blog post, we’ll break down reliability in terms of other metrics within reliability engineering: availability and maintainability.

4 Signs That Software Reliability Should Be Your Top Priority

Let's break these signs down together. Your Product Is Becoming a Utility. Your Users Are Demanding Reliability Over New Features. New Contracts Have Tighter SLAs (B2B) / Customers Are Getting Less Patient (B2C) Spaghetti Code Is Now Easier To Refactor Than To Fix.

Here Are the Metrics you Need to Understand Operational Health

In this blog post, we’ll walk you through holistic measures and best practices for understanding the operational health of your systems.

The Importance of Reliability Engineering

We’ll look at three big benefits of investing in reliability and explain how you can get started on your journey to reliability excellence.

Here are the Important Differences Between SLI, SLO, and SLA

In this blog post, we’ll cover what SLI, SLO, and SLA mean and how they contribute to your reliability goals.