Chaos Mesh ® is an open-source chaos engineering platform for Kubernetes. Although it provides rich capabilities to simulate abnormal system conditions, it still only solves a fraction of the Chaos Engineering puzzle. Besides fault injection, a full chaos engineering application consists of hypothesizing around defined steady states, running experiments in production, validating the system via test cases, and automating the testing.
This article describes how we use TiPocket, an automated testing framework to build a full Chaos Engineering testing loop for TiDB, our distributed database.
Before we can put a distributed system like TiDB into production, we have to ensure that it is robust enough for day-to-day use. For this reason, several years ago we introduced Chaos Engineering into our testing framework. In our testing framework, we:
This sounds like a solid process, and we’ve used it for years. However, as TiDB evolves, the testing scale multiplies. We have multiple fault scenarios, against which dozens of test cases run in the Kubernetes testing cluster. Even with Chaos Mesh helping to inject failures, the remaining work can still be demanding-not to mention the challenge of automating the pipeline to make the testing scalable and efficient.
This is why we built TiPocket, a fully-automated testing framework based on Kubernetes and Chaos Mesh. Currently, we mainly use it to test TiDB clusters. However, because of TiPocket’s Kubernetes-friendly design and extensible interface, you can use Kubernetes’ create and delete logic to easily support other applications.
Based on the above requirements, we need an automatic workflow that:
Fault injection is the core chaos testing. In a distributed database, faults can happen anytime, anywhere-from node crashes, network partitions, and file system failures, to kernel panics. This is where Chaos Mesh comes in.
Currently, TiPocket supports the following types of fault injection:
With fault injection handled, we need to think about verification. How do we make sure TiDB can survive these faults?
To validate how TiDB withstands chaos, we implemented dozens of test cases in TiPocket, combined with a variety of inspection tools. To give you an overview of how TiPocket verifies TiDB in the event of failures, consider the following test cases. These cases focus on SQL execution, transaction consistency, and transaction isolation.
SQLsmith is a tool that generates random SQL queries. TiPocket creates a TiDB cluster and a MySQL instance. The random SQL generated by SQLsmith is executed on TiDB and MySQL, and various faults are injected into the TiDB cluster to test. In the end, execution results are compared. If we detect inconsistencies, there are potential issues with our system.
Bank is a classical test case that simulates the transfer process in a banking system. Under snapshot isolation, all transfers must ensure that the total amount of all accounts must be consistent at every moment, even in the face of system failures. If there are inconsistencies in the total amount, there are potential issues with our system.
Porcupine is a linearizability checker in Go built to test the correctness of distributed systems. It takes a sequential specification as executable Go code, along with a concurrent history, and it determines whether the history is linearizable concerning the sequential specification. In TiPocket, we use the Porcupine checker in multiple test cases to check whether TiDB meets the linearizability constraint.
Elle is an inspection tool that verifies a database’s transaction isolation level. TiPocket integrates go-elle, the Go implementation of the Elle inspection tool, to verify TiDB’s isolation level.
These are just a few of the test cases TiPocket uses to verify TiDB’s accuracy and stability. For more test cases and verification methods, see our source code.
#tutorial #performance #cloud native #distributed system #chaos engineering #database