Building an Automated Testing Framework Based on Chaos Mesh and Argo

Chaos Mesh ® is an open-source chaos engineering platform for Kubernetes. Although it provides rich capabilities to simulate abnormal system conditions, it still only solves a fraction of the Chaos Engineering puzzle. Besides fault injection, a full chaos engineering application consists of hypothesizing around defined steady states, running experiments in production, validating the system via test cases, and automating the testing.

This article describes how we use TiPocket, an automated testing framework to build a full Chaos Engineering testing loop for TiDB, our distributed database.

Why Do We Need TiPocket?

Before we can put a distributed system like TiDB into production, we have to ensure that it is robust enough for day-to-day use. For this reason, several years ago we introduced Chaos Engineering into our testing framework. In our testing framework, we:

Observe the normal metrics and develop our testing hypothesis.
Inject a list of failures into TiDB.
Run various test cases to verify TiDB in fault scenarios.
Monitor and collect test results for analysis and diagnosis.

This sounds like a solid process, and we’ve used it for years. However, as TiDB evolves, the testing scale multiplies. We have multiple fault scenarios, against which dozens of test cases run in the Kubernetes testing cluster. Even with Chaos Mesh helping to inject failures, the remaining work can still be demanding-not to mention the challenge of automating the pipeline to make the testing scalable and efficient.

This is why we built TiPocket, a fully-automated testing framework based on Kubernetes and Chaos Mesh. Currently, we mainly use it to test TiDB clusters. However, because of TiPocket’s Kubernetes-friendly design and extensible interface, you can use Kubernetes’ create and delete logic to easily support other applications.

How Does it Work

Based on the above requirements, we need an automatic workflow that:

Injects chaos
Verifies the impact of that chaos
Automates the chaos pipeline
Visualizes the results

Injecting Chaos — Chaos Mesh

Fault injection is the core chaos testing. In a distributed database, faults can happen anytime, anywhere-from node crashes, network partitions, and file system failures, to kernel panics. This is where Chaos Mesh comes in.

Currently, TiPocket supports the following types of fault injection:

Network: Simulates network partitions, random packet loss, disorder, duplication, or delay of links.
Time skew: Simulates clock skew of the container to be tested.
Kill: Kills the specified pod, either randomly in a cluster or within a component (TiDB, TiKV, or Placement Driver (PD)).
I/O: Injects I/O delays in TiDB’s storage engine, TiKV, to identify I/O related issues.

With fault injection handled, we need to think about verification. How do we make sure TiDB can survive these faults?

Verifying Chaos Impacts: Test Cases

To validate how TiDB withstands chaos, we implemented dozens of test cases in TiPocket, combined with a variety of inspection tools. To give you an overview of how TiPocket verifies TiDB in the event of failures, consider the following test cases. These cases focus on SQL execution, transaction consistency, and transaction isolation.

Fuzz Testing: SQLsmith

SQLsmith is a tool that generates random SQL queries. TiPocket creates a TiDB cluster and a MySQL instance. The random SQL generated by SQLsmith is executed on TiDB and MySQL, and various faults are injected into the TiDB cluster to test. In the end, execution results are compared. If we detect inconsistencies, there are potential issues with our system.

Transaction Consistency Testing: Bank and Porcupine

Bank is a classical test case that simulates the transfer process in a banking system. Under snapshot isolation, all transfers must ensure that the total amount of all accounts must be consistent at every moment, even in the face of system failures. If there are inconsistencies in the total amount, there are potential issues with our system.

Porcupine is a linearizability checker in Go built to test the correctness of distributed systems. It takes a sequential specification as executable Go code, along with a concurrent history, and it determines whether the history is linearizable concerning the sequential specification. In TiPocket, we use the Porcupine checker in multiple test cases to check whether TiDB meets the linearizability constraint.

Transaction Isolation Testing: Elle

Elle is an inspection tool that verifies a database’s transaction isolation level. TiPocket integrates go-elle, the Go implementation of the Elle inspection tool, to verify TiDB’s isolation level.

These are just a few of the test cases TiPocket uses to verify TiDB’s accuracy and stability. For more test cases and verification methods, see our source code.

#tutorial #performance #cloud native #distributed system #chaos engineering #database