In a world shaped by technology, automation helps businesses move fast and pivot on a dime. Meanwhile, the foundational value of network operations is steadfast reliability, a condition that is traditionally best during periods of inactivity. “Network reliability engineering” (NRE) is an emerging approach to network automation that stabilizes and improves reliability while achieving the benefits of speed.- Juniper Networks 2018
Although the dawn of NRE was mid-2018, it is the apt time to forget about the title of a network administrator, network operator or network architect but to embrace the new title as Network Reliability EngineerJust like sys-admins have evolved form technicians to technologists entitled with SRE, Site Reliability Engineer, the NRE is the title declared for modern network engineers. A professional who can implement a network in a reliable ecosystem is an NRE Network Reliability Engineer.Coming to define what is Network Reliability Engineering, Engineering an automated network in a service model which can operate reliably not compromising scalability, rate of change and performance.
Just as SREs define their methods like DevOps, DevNetOps is the method embraced by network reliability engineering. While Dev and app-Ops work tightly together on top of cloud-native infra such as Kubernetes, the SRE cluster is the crucial role of ops that delivers operational simplicity by designing separation of concerns. Likewise, the NRE can develop simplicity by providing its consumers an API contract to the network, probably the IaaS and cluster SREs. But at the same time, it becomes crucial that your foundational level of networking should achieve simplicity.
#insights #neural networks
Network Operation Centers, or NOCs, serve as hubs for monitoring and incident response. A NOC is usually a physical location in an organization. NOC operators sit at a central desk with screens showing current service data. But, the functionality of a NOC can be distributed. Some organizations build virtual NOCs. These can be staffed fully remotely. This allows for distributed teams and follow-the-sun rotations. NOC as a service is another structure gaining in popularity. This is where the NOC is outsourced to a third party that offers it as a service similar to other infrastructure tools.
As IT services become more fragmented, shifting to virtual NOCs becomes more popular. These structures are far removed from the traditional big desk model, but their functions are the same. Any system where operators are able to monitor for incidents and respond to them can serve as a NOC.
The goals of NOC operators and SREs are aligned. Both try to improve the reliability of the system. In fact, SRE best practices applied to the NOC structure can take reliability to a new level. In this blog post, we’ll look at how SRE can improve NOC functions such as system monitoring, triage and escalation, incident response procedure, and ticketing.
The traditional image of a NOC is a huge grid of monitors showing every detail of the service’s data. A team of operators watches like hawks, catching any warning signs of incidents and responding. This system has several advantages. The completeness of the data displayed ensures nothing is missed. Also, having eyes on glass at all times promotes timely responses.
The SRE perspective on monitoring is different. The system monitors and alerts on metrics that have customer impact. These metrics are Service Level Indicators or SLIs. Instead of human observers, monitoring tools send alerts when these metrics hit thresholds. After iteration, these systems can be more reliable than a human observer. Yet, this doesn’t mean incidents won’t slip through the cracks. SRE teaches us that failure in any system is inevitable. Especially for organizations with multiple operating models, a mix of legacy and modern technologies, and the need to ensure governance and control, human observers in a NOC as another layer of monitoring may continue to be deeply essential.
To achieve the best of both worlds of your NOC and SRE practices, you’ll need to understand what response each of your metrics requires. For simple metrics that you can pull directly from system data, automated responses can save toil for your NOC operators. More nuanced metrics where an expert’s judgment may be necessary can be discussed in the NOC. This allows operators to focus on where their expertise is necessary. Monitoring tools handle the rest.
When a NOC operator notices an incident, their typical mode of operation is to first triage and try to remediate the issue via runbooks and existing documentation. They determine the severity and service area of the incident. Based on this, they escalate and engage the correct people for the incident response. In a traditional NOC structure, there’s a dedicated on-call team for incident response.
In the SRE world, things become less siloed. Incident classification applies across the organization. The developers most closely involved with each service area are also responsible for on-call shifts, rather than laying that responsibility squarely on dedicated on-call teams. NOC operators can collaborate with engineers on developing fair and effective on-call schedules. Yet NOC procedures for alerting don’t need to change. All of the infrastructures set up to alert and escalate will still apply. SRE only increases the range and effectiveness of these alerts by involving more experts. As service complexity grows, ensuring that a wide variety of experts can respond to incidents is essential.
#devops #site reliability engineering #site reliability #site reliability engineer #site reliability engineering tools #noc as a service #network operations center
We live in an era of reliability where users depend on having consistent access to services. When choosing between competing services, no feature is more important to users than reliability. But what does reliability mean?
To answer this question, we’ll break down reliability in terms of other metrics within reliability engineering: availability and maintainability. Distinguishing these terms isn’t a matter of semantics. Understanding the differences can help you better prioritize development efforts towards customer happiness.
Availability is the simplest building block of reliability. This metric describes what percentage of the time service is functioning. This is also referred to as the “uptime” of a service. Availability can be monitored by continuously querying the service and confirming responses return with expected speed and accuracy.
A service’s availability is a major component in how a user perceives the reliability. With this in mind, it can be tempting to set a goal for 100% uptime. But SRE teaches us that failure is inevitable; downtime-causing incidents will always occur outside of engineering expectations. Availability is often expressed in “nines,” representing how many decimals places the percentage of uptime can reach. Some major software companies will boast of “five nines,” or 99.999% uptime—but never 100%
Moreover, users will tolerate or even fail to notice downtime in some areas of your service. Development resources devoted to improving availability beyond expectations won’t increase customer happiness. Your service’s maintainability might need these resources instead.
Another major building block of reliability is maintainability. Maintainability factors into availability by describing how downtime originates and is resolved. When an incident causing downtime occurs, maintainable services can be repaired quickly. The sooner the incident is resolved, the sooner the service becomes available again.
There are two major components of maintainability: proactive and reactive.
#devops #availability #site reliability engineering #site reliability #site reliability engineer #maintainability #site reliability engineering tools
You know the companies who break away from the pack. You buy their products with prime shipping, you ride in their cars. You’ve seen them disrupt entire industries. It might seem like giants such as Amazon and Uber have always existed as towering pillars of profit, but that’s not so. What sets companies like these apart is a crucial piece of knowledge. They spotted the tipping point when reliability becomes a top priority to a software company’s success.
Pinpointing this tipping point is hard. After all, many companies can’t afford to stop shipping new features to shore up their software. Timing the transition to reliability well can launch a company ahead of the competition, and win the market (e.g. Amazon, Home Depot). But missing it can spell a company or even an industry’s doom (e.g. Barnes & Noble, Forever 21, and Gymboree in the retail apocalypse). Luckily, there are signs as you approach the tipping point. From examining over 300 companies, we’ve identified five.
Let’s break these signs down together.
When a product is a novelty, new adopters have a generous tolerance for errors because they are buying a vision of the future. Once the product becomes a utility, though, companies start to depend on it for critical functions. In the case of Twilio, suicide prevention hotlines are dependent on their reliability. And people start to depend on the product for daily life.
Consider Amazon. This company set the new consumer expectation for e-commerce and disrupted the sales of companies like Barnes & Noble and many more. How? First of all, the store is always open and accessible from your living room couch. Second, Prime delivers all packages with 2-day shipping. Do Amazon’s users care more about drone delivery (a feature), or 2-day shipping (reliability)?
While drone delivery is a neat novelty, the commodity of fast shipping is Amazon’s bread and butter. Users want their order, on-time, no matter what it takes to get there. How else can parents count on Santa coming on Christmas Eve?
While it’s tough to pinpoint the exact moment an industry converts from novelty to utility, we can look at three early indicators according to Harvard Business School.
Thanks to companies like Amazon, Google, Facebook, Netflix, etc., software delivery is transitioning from a novelty to a utility, from something we like to something we need every day. People expect every service to be as responsive and available as these tech giants. As your service loses its novelty, your users will look for reliability over features.
#devops #site reliability engineering #site reliability #site reliability engineer #site reliability engineering tools
In “Upstream” by Dan Health, we explore a variety of different problems ranging from homelessness, to high school graduation rates, to the state of sidewalks in different neighborhoods within the same city. In each of these examples, Dan discusses how upstream thinking decreased downstream work. Upstream thinking is characterized as proactive, collective actions to improve outcomes rather than reactions after an issue has already occurred.
You can also apply this method to software development.
With technology moving at a breakneck pace, it’s difficult to keep up with unplanned work such as incidents and unknown unknowns that come with increasing software complexity and interdependencies. Yet, we can’t halt development. As Dan points out, “Curiosity and innovation and competitiveness push them forward, forward, forward. When it comes to innovation, there’s an accelerator but no break” (“Upstream”, pg 224).
We can’t impede innovation, but we can Dan Heath’s wisdom from upstream thinking to move away from reactive modes of work and make our teams and our systems more reliable.
Before we can focus on implementing upstream thinking, we should acknowledge common barriers. Dan notes the problem here: “Organizations are constantly dealing with urgent short-term problems. Planning for speculative future ones is, by definition, not urgent. As a result, it’s hard to convince people to collaborate when hardship hasn’t forced them to” (220).
This might make it feel like everything is a barrier to upstream thinking. But Dan separates these issues into three groups: problem blindness, lack of ownership, and tunneling.
Problem blindness is self-explanatory: you are unaware that you have a problem. Issues and daily grievances are brushed off as just the way things are.
Consider alert fatigue. When you’re paged so often that you begin ignoring the alerts, you’re exhibiting problem blindness. Not only are you ignoring potentially important notifications, but you’re desensitized and possibly becoming burned out.
In this situation, you might hear people say things like, “Oh, that’s just the way it is. Our alerts are noisy. You can ignore them,” or “I can’t remember the last time I got a weekend off. You’ll get used to it.” Tony Lykke faced this issue and gave a talk at SREcon America in 2019. His talk, “Fixing On-Call when Nobody Thinks it’s (Too) Broken” describes this apathy.
It’s important to grow wise to problems. If you aren’t aware of them, you can’t begin to fix them. Question the status quo. Are there problems within your organization that has been dismissed or swept under the rug? These are sources of problem blindness. As Dan says, “The escape from problem blindness begins with the shock of awareness that you’ve come to treat the abnormal as normal” (37).
#devops #reliability #site reliability engineering #site reliability #site reliability engineer
If you’ve spent any time in tech circles lately, there are three letters you’ve surely heard: SRE. Site Reliability Engineering is the defining movement in tech today. Giants like Google and Amazon market their ability to provide reliable service and startups are now investing in reliability as an early priority.
But what makes reliability engineering so important? In this blog, we’ll look at three big benefits of investing in reliability and explain how you can get started on your journey to reliability excellence.
A reliable service is more valuable to a customer than one with inconsistent performance. It seems so obvious that you may think it goes without saying, but this reminder is crucial. Picture a typical user of your service. They are happy and engaged as they use your unique features, but don’t ignore the underlying assumption: your service works. Regardless of how your features stack up to competitors, users will always choose a functional option over a function-rich one. No feature is more important than reliability.
The consequences of unreliable software are also more costly than proactive investment in reliability. Consider how dependent you are on technology. On a given day, you rely on an alarm to wake you up, an app to report the weather and a calendar that reminds you of your schedule. You might hail a ride from Uber or use Google Maps to avoid traffic on the freeway. Maybe you get lunch delivered from Grubhub. When you arrive home, your Amazon package is right where you expect it. We trust in these services. When they go down, we feel angry.
These are the standards your service is judged by in the era of reliability. When the most popular software boasts uptime percentages of five nines, users begin to expect a level of consistency where downtime is a non-concern. The value generated by investing in reliability isn’t just in the additional uptime of your service, but in keeping your customers happy with your brand, increasing users, and lowering the potential for churn.
You may think of reliability engineering as an overhead cost to development, an additional layer of work that must be accounted for. Time and energy must indeed be dedicated to reliability, but you’ll find that adopting SRE best practices can empower and accelerate development.
SLOs and error budgeting work as a system to ensure downtime, latency, and other indicators of unreliability are kept within acceptable bounds. When these acceptable metrics are exceeded, SLO policies can refocus development efforts to stabilize and repair. On the other hand, when SLOs are within acceptable ranges and error budget is available, development can safely accelerate. Proposed changes that may affect reliability can be measured against the SLO, allowing you to build new features with confidence.
SLOs can also empower effective development by highlighting areas of greatest business impact. When determining your SLIs (the indicators your SLOs measure) you’ll discover insights on what areas of your service matter most to users. When you understand exactly what your users expect, you understand how your service is positioned and how to develop towards customer happiness.
Despite proactive measures, incidents are inevitable. However, with SRE principles, what would otherwise be considered a setback can become another investment in development. An incident retrospective is a document collaboratively constructed in response to an incident and reviewed by those involved afterward. This may seem at first like additional work in a situation where time is already limited, but the time it saves more than makes up for it. By analyzing patterns in incidents, developers learn where to spend proactive efforts in reliability. It also encourages developers to look at ways to avoid common classes of bugs and incentivizes writing more performant code.
#devops #resilience #site reliability engineering #site reliability #site reliability engineer