Modern computing systems are complex and in a constant state of change, as a direct result of adopting cloud native technologies and distributed system designs. These technologies and designs save money and can add resilience through automated responses to changing system conditions, but at the expense of sometimes hard-to-predict possibilities for failure. Gartner’s recently-released report entitled Innovation Insight for Chaos Engineering highlights this, beginning by comparing today’s systems to pinball machines. From here, the report moves into the practical aspects of implementation and usage, for effective identification of systemic failure modes to help you improve reliability.

We know that not everyone has access to Gartner research. We’ve been given permission to pull out key quotes, starting with the pinball analogy, that help clarify concepts around Chaos Engineering and reliability (all text in italics is taken directly from the report throughout the article).

Chaos Engineering and Predictability

Pinball machines have existed since the 1800s, first as highly mechanical games, later coin-operated and in the 1930s the games became electrified. The pinball games themselves frequently have a hero character or theme, and frequently there are goals or regions for scoring within the playboard that follows the character’s motive. This all sounds simple, yet you will never have the exact same experience when it comes to operation and play two games in a row. In this way, we can easily consider the pinball machine as a complex and deterministic chaos system.

Why is it then we expect digital systems with ownership and complexity far beyond a nostalgic game to provide such consistent experiences? We are working with deterministic (diagrammed and documented) and chaotic (unpredictable) systems, and we need to test them as such. Collaboration and teamwork will be key in the success of this endeavor. It is very likely that neither the development team nor the operations team will have total understanding of code constructs, application and system dependencies, resilient architectures, monitoring and automated remediation technologies. These are all inputs and parameters for consideration when crafting the attack plan.

Chaos Engineering is necessary because modern systems are chaotic and unpredictable. With services and nodes appearing and disappearing according to system load, we can never accurately state precisely what our architecture contains. We can guess and approximate, but not state with certainty.

The pinball analogy works because you never play the exact same game twice. You can learn how to use the flippers more effectively, aiming your shots and scoring bonuses, but even good pinball players will confirm that there are no guarantees. Sometimes the ball slips between the flippers and sometimes the entire machine lights up and makes fantastic noises, even when you thought you acted exactly the same way.

#devops #contributed #sponsored

How Pinball Machines Highlight the Importance of Chaos Engineering
1.20 GEEK