_In this podcast, Ana Medina, senior chaos engineer at Gremlin, sat down with InfoQ podcast co-host Daniel Bryant. Topics discussed included: how enterprise organisations are adopting chaos engineering with the requirements for guardrails and the need for “status checks” to ensure pre-experiment system health; how to run game days or IT fire drills when everyone is working remotely; and why teams should continually invest in learning from past incidents and preparing for inevitable failures within systems. _

Key Takeaways

  • Enterprise organisations want to implement “guardrails” before embracing chaos engineering. Critical capabilities include being able to rapidly terminate a chaos experiment if a production system is being unexpectedly impacted, and also running “pre-flight” status checks to verify that the system (and surrounding ecosystem) is healthy.
  • The global pandemic has undeniably impacted disaster recovery and business continuity plans and training. However, it is still possible to run game days or IT fire drills in a distributed working environment.
  • All software delivery personas will benefit from understanding more about disaster recovery and how to design resilient systems. As more teams are building complex distributed systems it is vitally important to encourage software architects and developers to learn more about this topic.
  • Much can be learned from analysing past incidents and near misses in production systems. There is a rich community forming around these ideas in software development, inspired by the learning from other disciplines.
  • To minimise chances of user-facing failure during important operational events or business dates, such as sales or holiday events, organisations should generally start planning 3-6 months out. This time allows an organisation to update service level objectives (SLOs), update runbooks, conduct fire drills, add external capacity, and modify on-call rotations.

#chaos engineering #architecture & design #development #podcast

Ana Medina on Chaos Engineering, Game Days, and Learning
1.10 GEEK