In this article, we’ll look at how SRE can improve NOC functions such as system monitoring, triage and escalation, incident response procedure, and ticketing.
Network Operation Centers, or NOCs, serve as hubs for monitoring and incident response. A NOC is usually a physical location in an organization. NOC operators sit at a central desk with screens showing current service data. But, the functionality of a NOC can be distributed. Some organizations build virtual NOCs. These can be staffed fully remotely. This allows for distributed teams and follow-the-sun rotations. NOC as a service is another structure gaining in popularity. This is where the NOC is outsourced to a third party that offers it as a service similar to other infrastructure tools.
As IT services become more fragmented, shifting to virtual NOCs becomes more popular. These structures are far removed from the traditional big desk model, but their functions are the same. Any system where operators are able to monitor for incidents and respond to them can serve as a NOC.
The goals of NOC operators and SREs are aligned. Both try to improve the reliability of the system. In fact, SRE best practices applied to the NOC structure can take reliability to a new level. In this blog post, we’ll look at how SRE can improve NOC functions such as system monitoring, triage and escalation, incident response procedure, and ticketing.
The traditional image of a NOC is a huge grid of monitors showing every detail of the service’s data. A team of operators watches like hawks, catching any warning signs of incidents and responding. This system has several advantages. The completeness of the data displayed ensures nothing is missed. Also, having eyes on glass at all times promotes timely responses.
The SRE perspective on monitoring is different. The system monitors and alerts on metrics that have customer impact. These metrics are Service Level Indicators or SLIs. Instead of human observers, monitoring tools send alerts when these metrics hit thresholds. After iteration, these systems can be more reliable than a human observer. Yet, this doesn't mean incidents won't slip through the cracks. SRE teaches us that failure in any system is inevitable. Especially for organizations with multiple operating models, a mix of legacy and modern technologies, and the need to ensure governance and control, human observers in a NOC as another layer of monitoring may continue to be deeply essential.
To achieve the best of both worlds of your NOC and SRE practices, you’ll need to understand what response each of your metrics requires. For simple metrics that you can pull directly from system data, automated responses can save toil for your NOC operators. More nuanced metrics where an expert’s judgment may be necessary can be discussed in the NOC. This allows operators to focus on where their expertise is necessary. Monitoring tools handle the rest.
When a NOC operator notices an incident, their typical mode of operation is to first triage and try to remediate the issue via runbooks and existing documentation. They determine the severity and service area of the incident. Based on this, they escalate and engage the correct people for the incident response. In a traditional NOC structure, there’s a dedicated on-call team for incident response.
In the SRE world, things become less siloed. Incident classification applies across the organization. The developers most closely involved with each service area are also responsible for on-call shifts, rather than laying that responsibility squarely on dedicated on-call teams. NOC operators can collaborate with engineers on developing fair and effective on-call schedules. Yet NOC procedures for alerting don’t need to change. All of the infrastructures set up to alert and escalate will still apply. SRE only increases the range and effectiveness of these alerts by involving more experts. As service complexity grows, ensuring that a wide variety of experts can respond to incidents is essential.
In this blog post, we’ll explain the Kubernetes Operator and discuss how it can evolve your SRE solution. Kubernetes is an open-source project that “containerizes” workloads and services and manages deployment and configurations. Released by Google in 2015, Kubernetes is now maintained by the Cloud Native Computing Foundation.
In this blog post, we’ll break down reliability in terms of other metrics within reliability engineering: availability and maintainability.
In this blog post, we’ll walk you through holistic measures and best practices for understanding the operational health of your systems.
Let's break these signs down together. Your Product Is Becoming a Utility. Your Users Are Demanding Reliability Over New Features. New Contracts Have Tighter SLAs (B2B) / Customers Are Getting Less Patient (B2C) Spaghetti Code Is Now Easier To Refactor Than To Fix.
In this blog post, we’ll cover what SLI, SLO, and SLA mean and how they contribute to your reliability goals.