Einar  Hintz

Einar Hintz

1594958400

The Principles of Chaos Engineering

Resilience is something those who use Kubernetes to run apps and microservices in containers aim for. When a system is resilient, it can handle losing a portion of its microservices and components without the entire system becoming inaccessible.

Resilience is achieved by integrating loosely coupled microservices. When a system is resilient, microservices can be updated or taken down without having to bring the entire system down. Scaling becomes easier too, since you don’t have to scale the whole cloud environment at once.

That said, resilience is not without its challenges. Building microservices that are independent yet work well together is not easy. You also have to create and maintain a reliable system with high fault tolerance. This is where Chaos Engineering comes into play.

What Is Chaos Engineering?

Chaos Engineering has been around for almost a decade now but it is still a relevent and useful concept to incorporate into improving your whole systems architecture. In essence, Chaos Engineering is the process of triggering and injecting faults into a system deliberately. Instead of waiting for errors to occur, engineers can take deliberate steps to cause (or simulate) errors in a controlled environment.

Chaos Engineering allows for better, more advanced resilience testing. Developers can now experiment in cloud-native distributed systems. Experiments involve testing both the physical infrastructure and the cloud ecosystem.

Chaos Engineering is not a new approach. In fact, companies like Netflix have been using resilience testing through Chaos Monkey, an in-house Chaos Engineering framework designed to improve the strength of cloud infrastructure for years now.

When dealing with a large-scale distributed system, Chaos Engineering provides an empirical way of building confidence by anticipating faults instead of reacting to them. The chaotic condition is triggered intentionally for this purpose.

There are a lot of analogies depicting how Chaos Engineering works, but the traffic light analogy represents the concept best. Conventional testing is similar to testing traffic lights individually to make sure that they work.

Chaos Engineering, on the other hand, means closing out a busy array of intersections to see how traffic reacts to the chaos of losing traffic lights. Since the test is run deliberately, more insights can be collected from the process.

#devops #chaos engineering #high fault tolerance #microservice-based architecture #microservices #microservices architecture #resilience engineering

What is GEEK

Buddha Community

The Principles of Chaos Engineering

The Principles of Chaos Engineering

Resilience is something those who use Kubernetes to run apps and microservices in containers aim for. When a system is resilient, it can handle losing a portion of its microservices and components without the entire system becoming inaccessible.

Resilience is achieved by integrating loosely coupled microservices. When a system is resilient, microservices can be updated or taken down without having to bring the entire system down. Scaling becomes easier too, since you don’t have to scale the whole cloud environment at once.

That said, resilience is not without its challenges. Building microservices that are independent yet work well together is not easy.

What Is Chaos Engineering?

Chaos Engineering has been around for almost a decade now but it is still a relevent and useful concept to incorporate into improving your whole systems architecture. In essence, Chaos Engineering is the process of triggering and injecting faults into a system deliberately. Instead of waiting for errors to occur, engineers can take deliberate steps to cause (or simulate) errors in a controlled environment.

Chaos Engineering allows for better, more advanced resilience testing. Developers can now experiment in cloud-native distributed systems. Experiments involve testing both the physical infrastructure and the cloud ecosystem.

Chaos Engineering is not a new approach. In fact, companies like Netflix have been using resilience testing through Chaos Monkey, an in-house Chaos Engineering framework designed to improve the strength of cloud infrastructure for years now.

When dealing with a large-scale distributed system, Chaos Engineering provides an empirical way of building confidence by anticipating faults instead of reacting to them. The chaotic condition is triggered intentionally for this purpose.

There are a lot of analogies depicting how Chaos Engineering works, but the traffic light analogy represents the concept best. Conventional testing is similar to testing traffic lights individually to make sure that they work.

Chaos Engineering, on the other hand, means closing out a busy array of intersections to see how traffic reacts to the chaos of losing traffic lights. Since the test is run deliberately, more insights can be collected from the process.

#devops #chaos engineering #chaos monkey #chaos #chaos testing

Einar  Hintz

Einar Hintz

1594958400

The Principles of Chaos Engineering

Resilience is something those who use Kubernetes to run apps and microservices in containers aim for. When a system is resilient, it can handle losing a portion of its microservices and components without the entire system becoming inaccessible.

Resilience is achieved by integrating loosely coupled microservices. When a system is resilient, microservices can be updated or taken down without having to bring the entire system down. Scaling becomes easier too, since you don’t have to scale the whole cloud environment at once.

That said, resilience is not without its challenges. Building microservices that are independent yet work well together is not easy. You also have to create and maintain a reliable system with high fault tolerance. This is where Chaos Engineering comes into play.

What Is Chaos Engineering?

Chaos Engineering has been around for almost a decade now but it is still a relevent and useful concept to incorporate into improving your whole systems architecture. In essence, Chaos Engineering is the process of triggering and injecting faults into a system deliberately. Instead of waiting for errors to occur, engineers can take deliberate steps to cause (or simulate) errors in a controlled environment.

Chaos Engineering allows for better, more advanced resilience testing. Developers can now experiment in cloud-native distributed systems. Experiments involve testing both the physical infrastructure and the cloud ecosystem.

Chaos Engineering is not a new approach. In fact, companies like Netflix have been using resilience testing through Chaos Monkey, an in-house Chaos Engineering framework designed to improve the strength of cloud infrastructure for years now.

When dealing with a large-scale distributed system, Chaos Engineering provides an empirical way of building confidence by anticipating faults instead of reacting to them. The chaotic condition is triggered intentionally for this purpose.

There are a lot of analogies depicting how Chaos Engineering works, but the traffic light analogy represents the concept best. Conventional testing is similar to testing traffic lights individually to make sure that they work.

Chaos Engineering, on the other hand, means closing out a busy array of intersections to see how traffic reacts to the chaos of losing traffic lights. Since the test is run deliberately, more insights can be collected from the process.

#devops #chaos engineering #high fault tolerance #microservice-based architecture #microservices #microservices architecture #resilience engineering

Chaos Engineering — How to Break AWS Infrastructure on Purpose

> 1. What is Chaos Engineering and the importance of it.
Image for post

Chaos Engineering is a type of Engineering where we test the system’s robustness, reliability and the ability to survive a disaster without manual intervention.

It is a process where we manually disrupt our Infrastructure productively and test how quickly and efficiently our Applications and Infra Autoheal themselves and their ability to thrive during a disaster or any System Catastrophe.

Sounds interesting, huh?

Well, it is very interesting because we would be experimenting, playing and disrupting our Infra and keenly observe how it reacts, learn and improve from it. This makes our Infra robust, stable and exhibit more confidence on our production stacks (which, I think is very important).

We will be knowing the weakness and the leaks in our system and help us overcome the issues beforehand in our Test Environment.

There are many Chaos experiments we can perform on our system like deleting a random EC2 Instance, deleting Services and etc which we shall explore in the last section.

> 2.Addressing Prerequisites — Setup your AWS Account and CLI on your Terminal
Let’s get our hands dirty by setting up our Infra ready to disrupt.

Prerequisites:

  1. Get the Access Key ID and Secret Access Key from AWS Account
  2. Install AWS CLI on your local machine
  3. Configure AWS credentials for the AWS Account on your machine
  4. Setup Infra — Create an Auto Scaling Group and attach 3 EC2 Instances to it as desired and Min Capacity (Assume Tasks/Services are running inside it).
  5. Validate AWS CLI by checking the number of Instances against the newly created AS

Get the Access Key ID and Secret Access Key from AWS Account

Go to https://aws.amazon.com/console/ and login to the AWS Console. Navigate to IAM section->Dashboard->Manage Security Credentials → AccessKeys Tab and extract your Access Key ID and Secret Access Key.

Go ahead and Create on if you don’t have one.

Image for post

AWS Access Keys (Masked for Security)

Install AWS CLI on your local machine

After jotting down the keys, let’s install AWS CLI v2 on your system. If you already have this configured, please proceed to Step 3 where we create the AWS Infra.

Install AWS CLI by following the commands mentioned in the AWS documentation.

Installing the AWS CLI version 2 on macOS

This topic describes how to install, update, and remove the AWS CLI version 2 on macOS. AWS CLI versions 1 and 2 use…

docs.aws.amazon.com

After installing AWS CLI, go to your mac Terminal and type in aws and that should list something like the image below. This confirms and validates that AWS CLI has been successfully configured.

Image for post

AWS CLI Validation

Configure AWS credentials for the AWS Account on your machine

Now, time to map your AWS Credentials on your local machine. We need to configure the Access Key ID and Secret Access Key on your machine so that you can connect to yourAWS Account from your machine and create and disrupt the Infra using AWS CLI.

aws configure should do the trick and ask for the Credentials, region and the output format. You might want to configure it as the image below.

Image for post

We can validate this by going to your ~/.aws/credentials

This file validates the Credentials we have just added in the terminal and displays the keys. With this step finished, we now have access to the AWS Account from our machine through AWS CLI. Eureka…!!!

Setup Infra — Create an Auto Scaling Group and attach 3 EC2 Instances to it as desired and Min Capacity (Assume Tasks/Services are running inside it).

We will be using the AWS CLI to create a Chaos Experiment and disrupt the Instances. For the time being we shall create an Auto Scaling Group and attach 3 EC2 Instances using the AWS Console.

Go straight to AWS Console and search for EC2 and go to the tab of “Auto Scaling Groups” and Create a new Auto Scaling Group.

a. Select the Appropriate Instance type (preferably a t2.micro -free tier)

b. Create a new Launch Configuration and associate an IAM role if you have one.

c. Create the ASG with a minimum of 3 EC2 Instances and a max of 6 Instances and add it in the required VPC and Subnets. Defaults are sufficient for this sample Experiment.

Image for post

Image for post

Validate AWS CLI by checking the number of Instances against the newly created ASG.

New ASG gets created and 3 new EC2 Instances gets automatically launched and come to a steady state. We have established the Infra. For this Experiment, we can assume that this is how our backend Infrastructure is setup and now we shall start disrupting. We can discuss more disruption techniques in the last section.

Image for post

#chaos-testing #chaos-monkey #disruption #aws #chaos-engineering

Sofia Gardiner

Sofia Gardiner

1622688560

The DiRT on Chaos Engineering at Google • Jason Cahoon • GOTO 2021

COURTNEY NASH: Prerequisites for Chaos Engineering

Chaos Engineering is often characterized as “breaking things in production” which lends it an air of something only feasible for elite or sophisticated organizations. In practice, it’s been a key element in digital transformation from the ground up for a number of companies ranging from pre-streaming Netflix to those in highly regulated industries like healthcare and financial services. In this talk, you’ll learn the basic prerequisites for Chaos Engineering, including a couple pragmatic ways to get started.

JASON CAHOON: The DiRT on Chaos Engineering @ Google

A shallow dive into 15 years of Chaos Engineering at Google, the lessons we’ve learned performing many thousands of disaster tests on production systems, and some tips on how to approach getting started with Chaos Engineering at your own organization.

TIMECODES

  • 00:00 Intro
  • 01:02 DiRT: Disaster Resiliency Testing
  • 02:53 Why?
  • 04:38 What we test?
  • 06:01 Testing themes
  • 10:01 Practical vs theoretical
  • 12:31 How?
  • 15:12 Picking what to test
  • 16:29 Steps for bootstrapping a disaster testing program
  • 18:25 Testing production vs testin in production
  • 20:16 Really, you’re breaking production though?!
  • 23:00 Reporting on results
  • 24:24 What have we learned?
  • 26:55 Test example: Run at service level
  • 28:51 Test example: Toggle the O-N / O-F-F discriminator
  • 30:25 Test example: Run without dependencies
  • 31:53 Test example: Hacked!

#chaos #chaos-engineering #developer

Chaos engineering — a remedy to unexpected madness for your website or app

During the unpredictable time, when servers are overloaded, often because of the high traffic and people flooding websites all over the globe arises the key question — how to guarantee uptime and resilience of websites and applications? We know how viral events can make the servers burst into colours of fire-red. Now a simple online shopping can do the same thing. There’s a solution to that question and _chaos _is the key — a chaos engineering to be exact.

Chaos engineering is the concept of “cloud armageddon”, which is successfully used by Netflix engineers in their daily work. It helps to provide the uptime and resilience needed to handle the traffic during the heaviest rush hours. That’s basically “testing” approach in extreme situations. It allows for “experimenting” in your environment. It reminds of preparing the specific conditions to conduct the test and see if your system is stable and fault-tolerant. It helps in finding all probable failures of various types.

But besides the chaos approach, we cannot forget about the role of cloud computing which provides the ability to scale up when needed. It’s helpful when we handle sudden traffic. All the cloud-native companies are preconditioned to manage the increased uptime. But when we need resilience we cannot rely only on the cloud capabilities.

Image for post

Photo by Daniel Páscoa on Unsplash

The resilience recalls to my mind the ability to recover quickly when the failure or any other unpredictable event occurs. The first thing to keep it on a high level is to think of failure. As we are in the topic of chaos, we need the ability to recognize the limits of our solution (website or application), know those boundaries and in that case, we should develop the plan to have a fallback strategy. Chaos engineering can provide us with the mechanism necessary to test and then implement the solution that can replace the existing one.

Secondly, remember about people. Whatever happens, we need to be prepared to keep doing our job and deliver the best possible service, at the same time identifying what’s important for our clients and focus on that. People also means the team as well as that the crew is quick and can adapt to upcoming changes. In my company, we do have a service 24h/7/365, in which we are able to sustain the highest possible level of service, be reactive with all the alerts while providing the new solutions and testing the other one for the same client.

And the last one, which was mentioned just a few words ago — **don’t forget to test. **Remember testing cannot be always the answer, but it can give you some. The result of a standard test is a binary value that uniquely determines whether the tested application will work correctly or not. But when we are talking about c_haos testing, _it allows you to take new actions affecting the development and improvement of the existing version of the system. This allows you to check the behaviour of other systems during such a controlled failure, i.e. the impact of the lack or partial non-operation of the service on the entire system.

Image for post

Photo by Alex Kotliarskyi on Unsplash

What we need to remember when providing the resilience of the system and guarantee uptime is that we should be able to quickly adapt to new circumstances. Remember that chaos engineering is a powerful practice that explores the sphere of systemic uncertainty. But by the end of the day, we are people whose work is supported by the technology.

Image for post

Image for post

**Subscribe to FAUN topics and get your weekly curated email of the must-read tech stories, news, and tutorials **🗞️

**Follow us on Twitter🐦 and Facebook👥 and **Instagram📷 **and join our Facebook and Linkedin Groups **💬

Image for post

#resilience-engineering #chaos-engineering #pandemic #cloud-computing #testing