On March 15 2021, Microsoft teams alongside many other services experienced a global outage.

Microsoft has released a ROOT cause ANALYSIS of the incident. In this video we will summarize what caused the outage and what Microsoft did to resolve it.

Microsoft services rely on Azure active directory for authentication and authorization.

Each service gets a token and verifies the token with a signing key 🔑 to make sure the token is still valid. And as part of automated security hygiene, Microsoft does a key rotation and invalidate keys that is no longer used.

There was a bug in the automated key rotation that removed a signing key that was not supposed to be removed. Unfortunately, this key signed so many tokens that are being used by many services

As a result of that removal, the metadata about the keys has been downloaded by all services, and all those tokens were marked as invalid (key is no longer trusted).

Users connecting to these services started to get errors because of this.

Microsoft engineering quickly realized that and reverted the metadata to force the key to be trusted again

However, because each service already cached that knowledge that the key was untrusted it wouldn’t refresh the new metadata (cache invalidation is the most difficult problem)

That exacerbated the problem, some services went down while others remain untrusting those token

Engineers finally pushed a fix to force a refresh of keys metadata to force services to pull new metadata and trust the key again

This is when the services started coming back to normal,

RCA
https://status.azure.com/en-us/status/history/

Slides for this video
https://payhip.com/b/RoCa

#developer

What Caused Microsoft Teams and Other Services to Go Down on March 15 2021?
1.90 GEEK