As chaos engineering becomes a more mainstream way of proactively seeking out your system’s weaknesses, we see it applied to increasingly complicated circumstances and with teams of all sizes.
One such area is serverless. After all, serverless computing is the language-agnostic, pay-as-you-go way to access backend services. This makes it multitenant, stateless, highly distributed, and heavily reliant on third parties. A heck of a lot can go wrong with so much out of your control.
From higher granularity to expanding attack surface to new failure types, serverless has so many potential points of failure, noted Thundra’s Product Vice President Emrah Samdan at ChaosConf, hosted by Gremlin. Chaos Engineering is one method to finding out where these potential failures are — before they cripple your operations.
If there was an underlying theme of this year’s ChaosConf, it’d be defining just what chaos engineering is. Because, even among expert fire starters, explaining the concept is as much art as it is science.
For Samdan, it’s not about being a glutton for punishment, breaking your system because you feel like it. And it’s not about placing blame.
For him, chaos engineering is all about asking: “What if?”
Samdan said, “You need to ask your system: What if your databases become unreachable? What if your whole region goes down? What if my downstream Lambda times out? Any type of failure can happen in your systems. Chaos engineering answers these questions.”
He says you need to answer these questions to establish what are the acceptable limits of your system. He analogized it to a vaccine, injecting a little bit more resiliency and confidence into your system every time.
“Chaos isn’t a pit. Chaos is a ladder.” — Emrah Samdan, Thundra
Echoing another message from ChaosConf, Samdan reminds us chaos engineering also isn’t just for giant streaming companies. Anyone can do it and you can get started small. He even recommends avoiding doing it in production at the start.
“You can just start when you are staging. Start small. Start injecting into a relatively new service, but put your tools in and just grow stronger with chaos experiments,” he recommended.
Start by measuring your steady-state — the ups and downs of your system. He recommends using an observability tool to accomplish this.
The typical system-level metrics include:
Samdan says typical business-level metrics include:
Set acceptable limits for each of these metrics. Then develop a hypothesis: What happens if this happens? Some examples can be:
You can ask big questions, but then only start experimenting on the small parts. Samdan reminds you to only inject failure into a controlled piece of your system, like only injecting latency towards one function, not the entire architecture. You want to maintain that smaller blast radius.
That’s also why you only run one experiment at a time. Then you can continue, injecting latency into two, three, four functions. He says you keep going until something breaks.
#development #devops #serverless #profile