During the unpredictable time, when servers are overloaded, often because of the high traffic and people flooding websites all over the globe arises the key question — how to guarantee uptime and resilience of websites and applications? We know how viral events can make the servers burst into colours of fire-red. Now a simple online shopping can do the same thing. There’s a solution to that question and _chaos _is the key — a chaos engineering to be exact.

Chaos engineering is the concept of “cloud armageddon”, which is successfully used by Netflix engineers in their daily work. It helps to provide the uptime and resilience needed to handle the traffic during the heaviest rush hours. That’s basically “testing” approach in extreme situations. It allows for “experimenting” in your environment. It reminds of preparing the specific conditions to conduct the test and see if your system is stable and fault-tolerant. It helps in finding all probable failures of various types.

But besides the chaos approach, we cannot forget about the role of cloud computing which provides the ability to scale up when needed. It’s helpful when we handle sudden traffic. All the cloud-native companies are preconditioned to manage the increased uptime. But when we need resilience we cannot rely only on the cloud capabilities.

Image for post

Photo by Daniel Páscoa on Unsplash

The resilience recalls to my mind the ability to recover quickly when the failure or any other unpredictable event occurs. The first thing to keep it on a high level is to think of failure. As we are in the topic of chaos, we need the ability to recognize the limits of our solution (website or application), know those boundaries and in that case, we should develop the plan to have a fallback strategy. Chaos engineering can provide us with the mechanism necessary to test and then implement the solution that can replace the existing one.

Secondly, remember about people. Whatever happens, we need to be prepared to keep doing our job and deliver the best possible service, at the same time identifying what’s important for our clients and focus on that. People also means the team as well as that the crew is quick and can adapt to upcoming changes. In my company, we do have a service 24h/7/365, in which we are able to sustain the highest possible level of service, be reactive with all the alerts while providing the new solutions and testing the other one for the same client.

And the last one, which was mentioned just a few words ago — **don’t forget to test. **Remember testing cannot be always the answer, but it can give you some. The result of a standard test is a binary value that uniquely determines whether the tested application will work correctly or not. But when we are talking about c_haos testing, _it allows you to take new actions affecting the development and improvement of the existing version of the system. This allows you to check the behaviour of other systems during such a controlled failure, i.e. the impact of the lack or partial non-operation of the service on the entire system.

Image for post

Photo by Alex Kotliarskyi on Unsplash

What we need to remember when providing the resilience of the system and guarantee uptime is that we should be able to quickly adapt to new circumstances. Remember that chaos engineering is a powerful practice that explores the sphere of systemic uncertainty. But by the end of the day, we are people whose work is supported by the technology.

Image for post

Image for post

**Subscribe to FAUN topics and get your weekly curated email of the must-read tech stories, news, and tutorials **🗞️

**Follow us on Twitter🐦 and Facebook👥 and **Instagram📷 **and join our Facebook and Linkedin Groups **💬

Image for post

#resilience-engineering #chaos-engineering #pandemic #cloud-computing #testing

Chaos engineering — a remedy to unexpected madness for your website or app
1.10 GEEK