As chaos engineering becomes a more mainstream way of proactively seeking out your system’s weaknesses, we see it applied to increasingly complicated circumstances and with teams of all sizes.

One such area is serverless. After all, serverless computing is the language-agnostic, pay-as-you-go way to access backend services. This makes it multitenant, stateless, highly distributed, and heavily reliant on third parties. A heck of a lot can go wrong with so much out of your control.

From higher granularity to expanding attack surface to new failure types, serverless has so many potential points of failure, noted Thundra’s Product Vice President Emrah Samdan at ChaosConf, hosted by GremlinChaos Engineering is one method to finding out where these potential failures are — before they cripple your operations.

What Chaos Engineering Isn’t

If there was an underlying theme of this year’s ChaosConf, it’d be defining just what chaos engineering is. Because, even among expert fire starters, explaining the concept is as much art as it is science.

For Samdan, it’s not about being a glutton for punishment, breaking your system because you feel like it. And it’s not about placing blame.

For him, chaos engineering is all about asking: “What if?”

Samdan said, “You need to ask your system: What if your databases become unreachable? What if your whole region goes down? What if my downstream Lambda times out? Any type of failure can happen in your systems. Chaos engineering answers these questions.”

He says you need to answer these questions to establish what are the acceptable limits of your system. He analogized it to a vaccine, injecting a little bit more resiliency and confidence into your system every time.

“Chaos isn’t a pit. Chaos is a ladder.” — Emrah Samdan, Thundra

How to Get Started with Chaos Engineering

Echoing another message from ChaosConf, Samdan reminds us chaos engineering also isn’t just for giant streaming companies. Anyone can do it and you can get started small. He even recommends avoiding doing it in production at the start.

“You can just start when you are staging. Start small. Start injecting into a relatively new service, but put your tools in and just grow stronger with chaos experiments,” he recommended.

Start by measuring your steady-state — the ups and downs of your system. He recommends using an observability tool to accomplish this.

The typical system-level metrics include:

  • Memory usage
  • 99% latency
  • CPU usage
  • Time to restore service

Samdan says typical business-level metrics include:

  • Apdex score, which, according to New Relic, is a ratio value of the number of satisfied and tolerating requests to the total requests made. Each satisfied request counts as one request, while each tolerating request counts as half a satisfied request.
  • Number of transactions, successful or otherwise.

Set acceptable limits for each of these metrics. Then develop a hypothesis: What happens if this happens? Some examples can be:

  • What if I inject latency of 300 milliseconds on average into every Lambda function in my architecture? SLA promise: My responses will still be in the acceptable latency range.
  • What if my DynamoDB table becomes unreachable? SLA promise: My system will continue performing graceful service degradation.

You can ask big questions, but then only start experimenting on the small parts. Samdan reminds you to only inject failure into a controlled piece of your system, like only injecting latency towards one function, not the entire architecture. You want to maintain that smaller blast radius.

That’s also why you only run one experiment at a time. Then you can continue, injecting latency into two, three, four functions. He says you keep going until something breaks.

#development #devops #serverless #profile

Breaking Serverless on Purpose with Chaos Engineering
1.10 GEEK