3 Lessons DevOps Can Learn From 5 Biggest Outages of Q2 2020

3 Lessons DevOps Can Learn From 5 Biggest Outages of Q2 2020

Read this article to learn 3 lessons from the biggest outages of IBM Cloud, T-Mobile, and GitHub. The second quarter of 2020 was marked by several serious outages at prominent services including IBM Cloud, GitHub, Slack, Zoom and even T-Mobile (Source: StatusGator Report).

Nobody is immune from outages but it’s better to learn from other’s mistakes than from your own. The second quarter of 2020 was marked by several serious outages at prominent services including IBM Cloud, GitHub, Slack, Zoom and even T-Mobile (Source: StatusGator Report). I’m sure you noticed these outages like our team did. I decided to share the lessons we learned from this downtime, hoping we can all grow from it.

Lesson 1: Don’t Host Your Status Page on Your Own Infrastructure – Outage of IBM Cloud

Having a status page helps to communicate with clients and keep them abreast of changes. This is a reliable and efficient tool. Status pages help clients and teams. Also, they can reduce support tickets because users will know what’s happened. Put simply, the status page is a convenient, efficient and necessary communication tool. But it becomes useless if you host it on your own infrastructure. It is advised to host your status page on a separate domain.

On June 10, 2020, IBM Cloud had an outage that impacted its general cloud services: Kubernetes Service, Cloud Object Service, VPN for VPC, Identity and Access Management (IAM), Continuous Delivery, App Connect, Watson AI and… their status pages. Fortunately, this page was available in the early stages of the outage and became available intermittently later. But in general, a lot of users criticized IBM on social media because of a lack of transparency and lack of communication. So, we can draw the first conclusion.

Hosting the status page on your own infrastructure can be dangerous for your reputation because of negative users’ impressions. It’s also quite useless because, in the event of downtime, it won’t be available – just like the rest of your services.

*What Should IBM Do Now? *

Of course, set up a status page. 

There are three services they can use for status pages:

  1. StatusPage.io. The largest and most popular. This is probably the only status page provider big enough to handle companies like IBM Cloud.
  2. Status.io. They are also very large and popular, and most likely able to offer the kind of scale that IBM Cloud would require. 
  3. Or they could build their own, being careful not to be dependent on any of their own infrastructure. They could host it as a static page on a third-party CDN to reduce complexity dependency on their network.

If you are smaller than IBM take a look at Cachet. This is an open-source tool. It’s pretty good and easy to deploy to any number of providers like DigitalOcean.

devops github ibm cloud sre lessons learned outages downtime prevention status page

Bootstrap 5 Complete Course with Examples

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Building a simple Applications with Vue 3

Deno Crash Course: Explore Deno and Create a full REST API with Deno

How to Build a Real-time Chat App with Deno and WebSockets

Convert HTML to Markdown Online

HTML entity encoder decoder Online

Multi-cloud Spending: 8 Tips To Lower Cost

Mismanagement of multi-cloud expense costs an arm and leg to business and its management has become a major pain point. Here we break down some crucial tips to take some of the management challenges off your plate and help you optimize your cloud spend.

How to Extend your DevOps Strategy For Success in the Cloud?

DevOps and Cloud computing are joined at the hip, now that fact is well appreciated by the organizations that engaged in SaaS cloud and developed applications in the Cloud. During the COVID crisis period, most of the organizations have started using cloud computing services and implementing a cloud-first strategy to establish their remote operations. Similarly, the extended DevOps strategy will make the development process more agile with automated test cases.

What are the benefits of cloud migration? Reasons you should migrate

To move or not to move? Benefits are multifold when you are migrating to the cloud. Get the correct information to make your decision, with our cloud engineering expertise.

A Simple Guide to Github Page Deployment

Long story short: Jekyll is a template engine changing markdown documents on static HTML webpages, that you can then host anywyere, because you don't need databases or server that has PHP or Python.

GitHub Demo Days - Using GitHub Actions for testing cloud native applications

A common challenge that cloud native application developers face is manually testing against inconsistent environments. GitHub Actions can be triggered based on nearly any GitHub event making it possible to build in accountability for updating tests and fixing bugs.