Chaos Engineering: What It Means, Why It Matters

Chaos engineering certainly evokes a lot of interest these days, especially as organizations increasingly rely on widely distributed data infrastructures that can extend across multicloud and on-premise environments — where the risk of failure grows exponentially. But while many agree that chaos engineering involves planning in some way, a widely accepted definition still remains elusive.

For Kolton Andrus, CEO and co-founder, Gremlin, chaos engineering is “is one of my favorite topics for debate,” and “is what makes chaos engineering sound fun and exciting.”

In this edition of The New Stack Makers podcast, Andrus defines chaos engineering and describes how organizations can make it work for them. Alex Williams, founder and publisher of The New Stack, hosted this episode.

The New Stack Makers · Kolton Andrus, CEO and co-founder, Gremlin on Chaos Engineering

The very idea of chaos — and an IT organization’s embrace of it — can conjure up fear in many. “[Chaos engineering] scares the pants off of some old school folks that aren’t comfortable with that kind of chaos in their environments. And so most people think chaos engineering is randomly breaking things and seeing what happens,” said Andrus. “I think that chaos engineering is thoughtful, planned experiments that teach us about our system and one of the key concepts that goes with that is this idea of the ‘blast radius.’ When we run this experiment, whom might we impact? Because the goal is to prevent outages, not to cause an outage and we never want to inadvertently cause customer pain. We never want to cause an outage because we were being cavalier in our approach.”

Andrus brings a deep background of the subject to the debate. Prior to founding Gremlin — as one of the pioneers in chaos engineering — Andrus became heavily involved in helping to avoid service outages, first at Amazon and then at Netflix. “When an outage happens, it’s time-intensive and expensive. It’s damaging to your brand,” he explained. “And if you work at a place like Amazon or Netflix, an outage costs hundreds of thousands to millions of dollars and so preventing every outage and preventing every minute of downtime is worth the investment.”

#devops #networking #podcast #sponsored #the new stack makers

thenewstack.io

Chaos Engineering: What It Means, Why It Matters