Site Reliability Engineering (SRE) Best Practices

What is Site Reliability Engineering (SRE)?

The site reliability engineering (SRE) concept originated at Google. The idea is closely related to the principles of DevOps. It’s an approach to IT operations. SRE teams use the software to manage systems, solve problems, and automate operations tasks.

SRE teams take the tasks that IT operations teams have done, often manually, and instead give them to engineers or ops teams who use tools and automation to solve problems and manage production systems.

It’s a valuable practice while creating scalable and highly reliable software systems. It helps organizations manage massive infrastructure through code, which is more scalable and sustainable for system admins managing hundreds of thousands of machines.

#sre #best-practices-sre #infracloud #cloud-native #devops-best-practices

What is GEEK

Buddha Community

Site Reliability Engineering (SRE) Best Practices
bindu singh

bindu singh

1647351133

Procedure To Become An Air Hostess/Cabin Crew

Minimum educational required – 10+2 passed in any stream from a recognized board.

The age limit is 18 to 25 years. It may differ from one airline to another!

 

Physical and Medical standards –

  • Females must be 157 cm in height and males must be 170 cm in height (for males). This parameter may vary from one airline toward the next.
  • The candidate's body weight should be proportional to his or her height.
  • Candidates with blemish-free skin will have an advantage.
  • Physical fitness is required of the candidate.
  • Eyesight requirements: a minimum of 6/9 vision is required. Many airlines allow applicants to fix their vision to 20/20!
  • There should be no history of mental disease in the candidate's past.
  • The candidate should not have a significant cardiovascular condition.

You can become an air hostess if you meet certain criteria, such as a minimum educational level, an age limit, language ability, and physical characteristics.

As can be seen from the preceding information, a 10+2 pass is the minimal educational need for becoming an air hostess in India. So, if you have a 10+2 certificate from a recognized board, you are qualified to apply for an interview for air hostess positions!

You can still apply for this job if you have a higher qualification (such as a Bachelor's or Master's Degree).

So That I may recommend, joining Special Personality development courses, a learning gallery that offers aviation industry courses by AEROFLY INTERNATIONAL AVIATION ACADEMY in CHANDIGARH. They provide extra sessions included in the course and conduct the entire course in 6 months covering all topics at an affordable pricing structure. They pay particular attention to each and every aspirant and prepare them according to airline criteria. So be a part of it and give your aspirations So be a part of it and give your aspirations wings.

Read More:   Safety and Emergency Procedures of Aviation || Operations of Travel and Hospitality Management || Intellectual Language and Interview Training || Premiere Coaching For Retail and Mass Communication |Introductory Cosmetology and Tress Styling  ||  Aircraft Ground Personnel Competent Course

For more information:

Visit us at:     https://aerofly.co.in

Phone         :     wa.me//+919988887551 

Address:     Aerofly International Aviation Academy, SCO 68, 4th Floor, Sector 17-D,                            Chandigarh, Pin 160017 

Email:     info@aerofly.co.in

 

#air hostess institute in Delhi, 

#air hostess institute in Chandigarh, 

#air hostess institute near me,

#best air hostess institute in India,
#air hostess institute,

#best air hostess institute in Delhi, 

#air hostess institute in India, 

#best air hostess institute in India,

#air hostess training institute fees, 

#top 10 air hostess training institute in India, 

#government air hostess training institute in India, 

#best air hostess training institute in the world,

#air hostess training institute fees, 

#cabin crew course fees, 

#cabin crew course duration and fees, 

#best cabin crew training institute in Delhi, 

#cabin crew courses after 12th,

#best cabin crew training institute in Delhi, 

#cabin crew training institute in Delhi, 

#cabin crew training institute in India,

#cabin crew training institute near me,

#best cabin crew training institute in India,

#best cabin crew training institute in Delhi, 

#best cabin crew training institute in the world, 

#government cabin crew training institute

Wilford  Pagac

Wilford Pagac

1599937200

How to Build Your SRE Team

As you implement SRE practices and culture at your organization, you’ll realize everyone has a part to play. From engineers setting SLOs to management upholding the virtue of blamelessness to marketing teams conducting retrospectives on email campaigns, there’s no part of an organization that doesn’t benefit from the SRE mentality.

However, while it’s not necessary to have people with the title of ‘SRE’ to successfully adopt the best practices of SRE, having people who are dedicated to stewardship of SRE practices is important to achieve reliability excellence. In this blog post, we’ll look at some of the many roles an SRE can play, and how to find people with those skill sets.

#devops #teams #site reliability engineering #site reliability #site reliability engineer #site reliability engineering tools

Iliana  Welch

Iliana Welch

1598403960

What Is a Kubernetes Operator and Why it Matters for SRE

Kubernetes is an open-source project that “containerizes” workloads and services and manages deployment and configurations. Released by Google in 2015, Kubernetes is now maintained by the  Cloud Native Computing Foundation. Since its release, it has become a worldwide phenomenon. The majority of cloud-native  companies use it, SaaS vendors offer commercial prebuilt versions, and there’s even an annual  convention!

What has made Kubernetes become such a fundamental service? A major factor is its automation capabilities. Kubernetes can automatically make changes to the configuration of deployed containers or even deploy new containers based on metrics it tracks or requests made by engineers. Having Kubernetes handle these processes saves time, eliminates toil, and increases consistency.

If these benefits sound familiar, it might be because they overlap with the philosophies of SRE. But how do you incorporate the automation of Kubernetes into your SRE practices? In this blog post, we’ll explain the Kubernetes Operator—the Kubernetes function at the heart of customized automation—and discuss how it can evolve your SRE solution.

What the Kubernetes Operator Can Do

In Kubernetes Operators: Automating the Container Orchestration Platform, authors Jason Dobies and Joshua Wood describe an Operator as “an automated Site Reliability Engineer for its application.” Given an SRE’s multifaceted experience and diverse workload, this is a bold statement. So what exactly can the Operator do?

#tutorial #devops #kubernetes #site reliability engineering #site reliability #site reliability engineer #site reliability engineering tools #kubernetes operators #kubernetes operator

Kole  Haag

Kole Haag

1602925200

Availability, Maintainability, Reliability: What's the Difference?

We live in an era of reliability where users depend on having consistent access to services. When choosing between competing services, no feature is more important to users than reliability. But what does reliability mean?

To answer this question, we’ll break down reliability in terms of other metrics within reliability engineering: availability and maintainability. Distinguishing these terms isn’t a matter of semantics. Understanding the differences can help you better prioritize development efforts towards customer happiness.

Availability

Availability is the simplest building block of reliability. This metric describes what percentage of the time service is functioning. This is also referred to as the “uptime” of a service. Availability can be monitored by continuously querying the service and confirming responses return with expected speed and accuracy.

A service’s availability is a major component in how a user perceives the reliability. With this in mind, it can be tempting to set a goal for 100% uptime. But SRE teaches us that failure is inevitable; downtime-causing incidents will always occur outside of engineering expectations. Availability is often expressed in “nines,” representing how many decimals places the percentage of uptime can reach. Some major software companies will boast of “five nines,” or 99.999% uptime—but never 100%

Moreover, users will tolerate or even fail to notice downtime in some areas of your service. Development resources devoted to improving availability beyond expectations won’t increase customer happiness. Your service’s maintainability might need these resources instead.

Maintainability

Another major building block of reliability is maintainability. Maintainability factors into availability by describing how downtime originates and is resolved. When an incident causing downtime occurs, maintainable services can be repaired quickly. The sooner the incident is resolved, the sooner the service becomes available again.

There are two major components of maintainability: proactive and reactive.

  • Proactive maintainability involves building a codebase that can be easily understood and changed. As development progresses, issues will arise from incompatibility with existing code. If engineers are writing “spaghetti code” instead of prioritizing maintainability, issues are likely to occur and be difficult to find and solve. Proactive maintenance also includes procedures such as quality assurance and testing.
  • Reactive maintainability describes a service’s ability to be repaired after incidents. This is influenced by a service’s incident response procedures. As incidents are inevitable, great incident response and guardrails are a necessity. If incident response procedures are reliable, teams will resolve incidents quickly. Proper incident responses also foster learning to reduce recurrence. A highly maintainable service allows engineers to implement these lessons effectively

#devops #availability #site reliability engineering #site reliability #site reliability engineer #maintainability #site reliability engineering tools

Wiley  Mayer

Wiley Mayer

1602954000

Guide For Implementing SRE In NOCs

Network Operation Centers, or NOCs, serve as hubs for monitoring and incident response. A NOC is usually a physical location in an organization. NOC operators sit at a central desk with screens showing current service data. But, the functionality of a NOC can be distributed. Some organizations build virtual NOCs. These can be staffed fully remotely. This allows for distributed teams and follow-the-sun rotations. NOC as a service is another structure gaining in popularity. This is where the NOC is outsourced to a third party that offers it as a service similar to other infrastructure tools.

As IT services become more fragmented, shifting to virtual NOCs becomes more popular. These structures are far removed from the traditional big desk model, but their functions are the same. Any system where operators are able to monitor for incidents and respond to them can serve as a NOC.

The goals of NOC operators and SREs are aligned. Both try to improve the reliability of the system. In fact, SRE best practices applied to the NOC structure can take reliability to a new level. In this blog post, we’ll look at how SRE can improve NOC functions such as system monitoring, triage and escalation, incident response procedure, and ticketing.

Monitor Smarter by Focusing On Complex Metrics

The traditional image of a NOC is a huge grid of monitors showing every detail of the service’s data. A team of operators watches like hawks, catching any warning signs of incidents and responding. This system has several advantages. The completeness of the data displayed ensures nothing is missed. Also, having eyes on glass at all times promotes timely responses.

The SRE perspective on monitoring is different. The system monitors and alerts on metrics that have customer impact. These metrics are Service Level Indicators or SLIs. Instead of human observers, monitoring tools send alerts when these metrics hit thresholds. After iteration, these systems can be more reliable than a human observer. Yet, this doesn’t mean incidents won’t slip through the cracks. SRE teaches us that failure in any system is inevitable. Especially for organizations with multiple operating models, a mix of legacy and modern technologies, and the need to ensure governance and control, human observers in a NOC as another layer of monitoring may continue to be deeply essential.

To achieve the best of both worlds of your NOC and SRE practices, you’ll need to understand what response each of your metrics requires. For simple metrics that you can pull directly from system data, automated responses can save toil for your NOC operators. More nuanced metrics where an expert’s judgment may be necessary can be discussed in the NOC. This allows operators to focus on where their expertise is necessary. Monitoring tools handle the rest.

Escalate and Triage With Classification and On-Call

When a NOC operator notices an incident, their typical mode of operation is to first triage and try to remediate the issue via runbooks and existing documentation. They determine the severity and service area of the incident. Based on this, they escalate and engage the correct people for the incident response. In a traditional NOC structure, there’s a dedicated on-call team for incident response.

In the SRE world, things become less siloed. Incident classification applies across the organization. The developers most closely involved with each service area are also responsible for on-call shifts, rather than laying that responsibility squarely on dedicated on-call teams. NOC operators can collaborate with engineers on developing fair and effective on-call schedules. Yet NOC procedures for alerting don’t need to change. All of the infrastructures set up to alert and escalate will still apply. SRE only increases the range and effectiveness of these alerts by involving more experts. As service complexity grows, ensuring that a wide variety of experts can respond to incidents is essential.

#devops #site reliability engineering #site reliability #site reliability engineer #site reliability engineering tools #noc as a service #network operations center