With cloud native applications, there’s always a chance that something could interrupt your services. Maybe a wire gets unplugged and that brings down your server or one of your services loses network connections that you depend on.

These are issues your system doesn’t typically account for in code or infrastructure. You have a way to figure out where some of your system weaknesses are and give those areas extra attention. It’s difficult to build a system that accounts for every odd occurrence that might happen, but with some chaos engineering, you can make your system resilient against a lot of unexpected conditions.

What is chaos engineering

The definition of chaos engineering is experimenting on your system to build confidence in the system’s ability to withstand unpredictable conditions in production. That means chaos engineering happens when you run experiments on a system in production.

This goes against almost everything we know about best practices and that’s what makes it a useful concept. You’re trying to see how well your system will stand against any number of random things that could happen.

With all of the different dependencies and services an application can need to operate, there are a lot of places where the system could unexpectedly fail. The goal of chaos engineering is to find where those failure points are and build safeguards around them before they become critical issues.

#devops

Chaos Engineering: What You Don't Know
1.10 GEEK