How do you failover without falling over? Uptime and reliability are at the core of chaos engineering, the art and science of rooting out your systems’ weaknesses. It’s all about increasing the certainty that your backups and your backup’s backups are going to work.

At this year’s virtual ChaosConfAdrian Cockcroft, vice president of cloud architecture strategy at Amazon Web Services, talked about the dangers of “availability theater” and how to better ground your system’s reliability in reality. He started by questioning if the audience even has a backup data center and if they’ve ever tested its failover reliability.

“If you have a backup data center but you never failed over to it and are not confident to failover to it in a moment’s notice, you invested a lot of money for a façade of availability,” he said.

Interestingly, in recent years, known outages are more likely to be caused by IT and network problems than power issues. Cockcroft quoted the 1984 book “Normal Accidents” on complex systems having multilayered failures that are “unexpected, incomprehensible, uncontrollable and unavoidable.”

While, like natural disasters, these outages may be unavoidable, you can still do everything in your power to prepare for them. Today we will share Cockcroft’s advice for continuously tested resilience.

#cloud native #monitoring #profile #react native

Adrian Cockcroft on ‘Failover Theater’ and Achieving True Continuous Resilience
1.15 GEEK