How we define a “service” for our BCM program

If you ask three people what a service is, you may get three different answers. At Microsoft, we define a service (business process or technology) as a means of delivering value to customers (first- or third-party) by facilitating outcomes customers want to achieve.

To ensure the highest level of resiliency for each of our “services” we include:

  • People: The people who are responsible for providing the service.
  • Process: The methodology used to provide the service.
  • Technology: The tools used to deliver the service or the technology itself delivered as the value.

Customers see our services as product offerings that are comprised of various bundled services. Each individual service is mapped in our inventory and run through the BCM program to ensure that the people, processes, and technologies for those services are resilient to a variety of failures.

Our end-to-end program identifies, prioritizes, maps, and tests every service providing more than “box checking” compliance. Instead, we focus on a broad understanding of how to provide the best service to our customers who demand reliable service offerings for their business.

How the BCM program is managed in practice

Through a sophisticated set of tooling, every service (both internal and external facing) is uniquely mapped and shared with a string of compliance tooling addressing privacy, security, BCM, and more. This ensures that every service contains sharable meta-data for other tools regardless of type or criticality.

In the context of this post, records are automatically ported to our BCM management tool. Once there, they are automatically scoped for disaster recovery (DR) requirements that meet certifiable standards and to deliver on our customer promises. These records contain the most familiar elements of a BCM program, including business impact analysis, dependencies, workforce, suppliers, recovery plans, and tests. In addition, we provide insight into potential customer impacts, detection capabilities, and willingness to failover.

Testing recoverability

No amount of tooling, policies, or documents can provide the same level of confidence in service recovery and sustainability as comprehensive testing. Azure services test at various levels ranging from individual unit tests, all the way to complete “region down” scenarios. Every service must show proof of testing and that their recovery meets their stated goals—both internally and what we guarantee to our end customers in the Service Level Agreements (SLAs). Tabletop testing, in which simulated emergencies are merely discussed, is not considered acceptable or compliant for our program.

Our most robust integrated testing takes place in our “Canary” environment that consists of two distinct production datacenter regions: one in Eastern United States and the other in Central United States.

On a regular basis, we test service recovery with a complete zone or region shutdown (simulating a major production outage or catastrophic loss), forcing all services to invoke their recovery plans. These tests not only verify service recoverability, but also test our incident response team’s processes for managing major incidents. For Availability Zones, we test and verify the seamless continuation of service availability in the face of an entire zone loss. These are end-to-end tests that include detection, response, coordination, and recovery.

All processes from detection to response and action are performed as if it were a real service-impacting event. Service responders are the normal on-call engineers. Additionally, we also test synthetic customer responsible functions, such as virtual machine (VM) failover to paired regions, ensuring customer workloads can operate in large scale failure scenarios.

#management #azure #azure business #continuity management

Advancing Azure business continuity management
1.25 GEEK