Wiley  Mayer

Wiley Mayer

1602954000

Guide For Implementing SRE In NOCs

Network Operation Centers, or NOCs, serve as hubs for monitoring and incident response. A NOC is usually a physical location in an organization. NOC operators sit at a central desk with screens showing current service data. But, the functionality of a NOC can be distributed. Some organizations build virtual NOCs. These can be staffed fully remotely. This allows for distributed teams and follow-the-sun rotations. NOC as a service is another structure gaining in popularity. This is where the NOC is outsourced to a third party that offers it as a service similar to other infrastructure tools.

As IT services become more fragmented, shifting to virtual NOCs becomes more popular. These structures are far removed from the traditional big desk model, but their functions are the same. Any system where operators are able to monitor for incidents and respond to them can serve as a NOC.

The goals of NOC operators and SREs are aligned. Both try to improve the reliability of the system. In fact, SRE best practices applied to the NOC structure can take reliability to a new level. In this blog post, we’ll look at how SRE can improve NOC functions such as system monitoring, triage and escalation, incident response procedure, and ticketing.

Monitor Smarter by Focusing On Complex Metrics

The traditional image of a NOC is a huge grid of monitors showing every detail of the service’s data. A team of operators watches like hawks, catching any warning signs of incidents and responding. This system has several advantages. The completeness of the data displayed ensures nothing is missed. Also, having eyes on glass at all times promotes timely responses.

The SRE perspective on monitoring is different. The system monitors and alerts on metrics that have customer impact. These metrics are Service Level Indicators or SLIs. Instead of human observers, monitoring tools send alerts when these metrics hit thresholds. After iteration, these systems can be more reliable than a human observer. Yet, this doesn’t mean incidents won’t slip through the cracks. SRE teaches us that failure in any system is inevitable. Especially for organizations with multiple operating models, a mix of legacy and modern technologies, and the need to ensure governance and control, human observers in a NOC as another layer of monitoring may continue to be deeply essential.

To achieve the best of both worlds of your NOC and SRE practices, you’ll need to understand what response each of your metrics requires. For simple metrics that you can pull directly from system data, automated responses can save toil for your NOC operators. More nuanced metrics where an expert’s judgment may be necessary can be discussed in the NOC. This allows operators to focus on where their expertise is necessary. Monitoring tools handle the rest.

Escalate and Triage With Classification and On-Call

When a NOC operator notices an incident, their typical mode of operation is to first triage and try to remediate the issue via runbooks and existing documentation. They determine the severity and service area of the incident. Based on this, they escalate and engage the correct people for the incident response. In a traditional NOC structure, there’s a dedicated on-call team for incident response.

In the SRE world, things become less siloed. Incident classification applies across the organization. The developers most closely involved with each service area are also responsible for on-call shifts, rather than laying that responsibility squarely on dedicated on-call teams. NOC operators can collaborate with engineers on developing fair and effective on-call schedules. Yet NOC procedures for alerting don’t need to change. All of the infrastructures set up to alert and escalate will still apply. SRE only increases the range and effectiveness of these alerts by involving more experts. As service complexity grows, ensuring that a wide variety of experts can respond to incidents is essential.

#devops #site reliability engineering #site reliability #site reliability engineer #site reliability engineering tools #noc as a service #network operations center

What is GEEK

Buddha Community

Guide For Implementing SRE In NOCs
Wiley  Mayer

Wiley Mayer

1602954000

Guide For Implementing SRE In NOCs

Network Operation Centers, or NOCs, serve as hubs for monitoring and incident response. A NOC is usually a physical location in an organization. NOC operators sit at a central desk with screens showing current service data. But, the functionality of a NOC can be distributed. Some organizations build virtual NOCs. These can be staffed fully remotely. This allows for distributed teams and follow-the-sun rotations. NOC as a service is another structure gaining in popularity. This is where the NOC is outsourced to a third party that offers it as a service similar to other infrastructure tools.

As IT services become more fragmented, shifting to virtual NOCs becomes more popular. These structures are far removed from the traditional big desk model, but their functions are the same. Any system where operators are able to monitor for incidents and respond to them can serve as a NOC.

The goals of NOC operators and SREs are aligned. Both try to improve the reliability of the system. In fact, SRE best practices applied to the NOC structure can take reliability to a new level. In this blog post, we’ll look at how SRE can improve NOC functions such as system monitoring, triage and escalation, incident response procedure, and ticketing.

Monitor Smarter by Focusing On Complex Metrics

The traditional image of a NOC is a huge grid of monitors showing every detail of the service’s data. A team of operators watches like hawks, catching any warning signs of incidents and responding. This system has several advantages. The completeness of the data displayed ensures nothing is missed. Also, having eyes on glass at all times promotes timely responses.

The SRE perspective on monitoring is different. The system monitors and alerts on metrics that have customer impact. These metrics are Service Level Indicators or SLIs. Instead of human observers, monitoring tools send alerts when these metrics hit thresholds. After iteration, these systems can be more reliable than a human observer. Yet, this doesn’t mean incidents won’t slip through the cracks. SRE teaches us that failure in any system is inevitable. Especially for organizations with multiple operating models, a mix of legacy and modern technologies, and the need to ensure governance and control, human observers in a NOC as another layer of monitoring may continue to be deeply essential.

To achieve the best of both worlds of your NOC and SRE practices, you’ll need to understand what response each of your metrics requires. For simple metrics that you can pull directly from system data, automated responses can save toil for your NOC operators. More nuanced metrics where an expert’s judgment may be necessary can be discussed in the NOC. This allows operators to focus on where their expertise is necessary. Monitoring tools handle the rest.

Escalate and Triage With Classification and On-Call

When a NOC operator notices an incident, their typical mode of operation is to first triage and try to remediate the issue via runbooks and existing documentation. They determine the severity and service area of the incident. Based on this, they escalate and engage the correct people for the incident response. In a traditional NOC structure, there’s a dedicated on-call team for incident response.

In the SRE world, things become less siloed. Incident classification applies across the organization. The developers most closely involved with each service area are also responsible for on-call shifts, rather than laying that responsibility squarely on dedicated on-call teams. NOC operators can collaborate with engineers on developing fair and effective on-call schedules. Yet NOC procedures for alerting don’t need to change. All of the infrastructures set up to alert and escalate will still apply. SRE only increases the range and effectiveness of these alerts by involving more experts. As service complexity grows, ensuring that a wide variety of experts can respond to incidents is essential.

#devops #site reliability engineering #site reliability #site reliability engineer #site reliability engineering tools #noc as a service #network operations center

Marlee  Carter

Marlee Carter

1620199243

5 Basic Differences Between DevOps and SRE You Should Know About

The world of Information Technology and software development often conflates DevOps with SRE to mean one and the same thing. However, there are vast differences between the two. While Site Reliability Engineering (SRE) has gained traction in recent years, DevOps has been around much longer (even before the term DevOps existed).

To put it simply, DevOps and SRE are both practices put in place to deliver software faster. The only difference between the two is in their approaches; DevOps is focused on reducing the software development lifecycle, and SRE concentrates on eliminating system weaknesses to achieve the same purpose.

In this article, we will look at the fundamental ways in which DevOps and SRE differ from each other. Before we do that, let’s start with understanding what DevOps and SRE are.

#software development #devops #devops and sre #devops vs sre #sre

5 Basic Differences Between DevOps and SRE You Should Know About

The world of Information Technology and software development often conflates DevOps with SRE to mean one and the same thing. However, there are vast differences between the two. While Site Reliability Engineering (SRE) has gained traction in recent years, DevOps has been around much longer (even before the term DevOps existed).

To put it simply, DevOps and SRE are both practices put in place to deliver software faster. The only difference between the two is in their approaches; DevOps is focused on reducing the software development lifecycle, and SRE concentrates on eliminating system weaknesses to achieve the same purpose.

In this article, we will look at the fundamental ways in which DevOps and SRE differ from each other. Before we do that, let’s start with understanding what DevOps and SRE are.

#devops #devops and sre #devops vs sre #sre

Murray  Beatty

Murray Beatty

1596344940

Help Your Data Science Career By Publishing Your Work!

This guide aims to cover everything that a data science learner may need to write and publish articles on the internet. It covers why you should write, writing advice for new writers, and a list of places that invite contributions from new writers.

Let’s get to it!

Why you should write:

Writing isn’t just for “writers”. The art of writing well is for everyone to learn - programmers, marketers, managers and leaders, alike. And yes, data scientists and analysts too!

You should write articles because when you do:

You learn:

Writing teaches you the art of writing. It’s kind of circular but it’s true.

Make no mistake, the art of writing isn’t about grammar (although, that’s important) and flowery language (definitely not important). It’s about conveying your thoughts with clarity in simple language.

And learning this art is important even if you absolutely know that you don’t want to write blogs/articles for a living. It’s important because all the jobs have some form of writing involved - messages, emails, memos and the whole spectrum. So basically, writing is a medium for almost any job you can have.

Apart from that, when you write you learn the things that you thought you knew but didn’t really know. So, writing is an opportunity to learn better.

#data science career tips #guide #guides #publishing work #writing guide

Zakary  Goyette

Zakary Goyette

1603725960

Implementing TabNet in PyTorch

Deep Learning has taken over vision, natural language processing, speech recognition, and many other fields achieving astonishing results and even superhuman performance in some. However, the use of deep learning to model tabular data has been relatively limited.

For tabular data, the most common approach is the use of tree-based models and their ensembles. The tree-based models globally select features which reduce the entropy the most. Ensemble methods like bagging, boosting improve these tree-based methods further by reducing the model variance. Recent tree-based ensembles like XGBoost and LightGBM have dominated Kaggle competitions.

TabNet is a neural architecture developed by the research team at Google Cloud AI. It was able to achieve state of the art results on several datasets in both regression and classification problems. It combines the features of neural nets to fit very complex functions and the **feature selection **property of tree-based algorithms. In other words, the model learns to select only the relevant features during the training process. Moreover, contrary to tree-based models which can only do feature-selection globally, the feature selection process in TabNet is instance-wise. Another desirable feature of TabNet is interpretability. Contrary to most of deep learning, where the neural networks act like black boxes, we can interpret which features the models selects in case of TabNet.

In this blog, I will take you through a step-wise beginner-friendly implementation of TabNet in PyTorch. Let’s get started!!

#beginners-guide #tabular-data #implementation #deeplearing #data-science