You’re sound asleep when the alarms go off. It’s 3 a.m. You wipe your eyes, check your phone. You know something is wrong. Very wrong.

The website is down. Your application is broken. The only light in the room is coming from your computer monitor. The Gremlin in the system can be hiding anywhere, and it’s your team’s job to find it.

And fix what’s broken, fast.

As someone who runs PR for various DevOps startups, I’ve seen this story play out over and over. The reputation cost alone of a major outage is enough to instill fear in even the most seasoned engineer!

But the truth is, every company has system failure. And we’re still a bit away from getting online systems to look more like utilities, where you flip a switch and it just works. So sharing stories and normalizing failure (e.g. transparent and blameless postmortems) are positive trends for the industry; it makes everyone feel less scared and alone.

I’m not going to cite generic numbers about the cost of downtime. For Amazon it may be millions per hour; for your company, it may be confined to a frustrating customer experience, if dealt with swiftly. But ultimately these kinds of situations lose businesses money, hurt reputations, drain engineering resources and fuel interest in the competition.

So in the spirit of Halloween, and more importantly in the spirit of sharing experiences to better prevent them from happening in the future, let’s take a look at six scary outages stories, as told by CTOs themselves.

#culture #devops #technology #contributed

6 Scary Outage Stories from CTOs
1.20 GEEK