Can We Trust the Cloud Not to Fail?

Can We Trust the Cloud Not to Fail?

I will start with the theory behind failure detection, and then review a couple of real-world examples of how the mechanism works in a real cloud - on Azure. Even though this article includes real-world applications of failure detection within Azure, the same notions could also apply to GCP, AWS, or any other distributed system.

  • It is crucial for technical decision-makers to understand the specifics of what’s at the core of systems and how they work to provide the promised guarantees.
  • Failure detectors are one of the essential techniques to discover failures in a distributed system. There are different types of failure detectors offering different guarantees, depending on their completeness and accuracy.
  • In practical real-world systems, a lot of failure detectors are implemented using heartbeats and timeouts.
  • To achieve at least weak accuracy, the timeout value should be chosen so that a node doesn’t receive false suspicions. It can be done by changing the timeout and increasing its value after each false suspicion.
  • Service Fabric is one of the examples of systems implementing failure detection - it uses a lease mechanism, similar to heartbeats. Azure Cosmos DB relies on Service Fabric.

In short, we can’t trust the Cloud to never fail. There are always some underlying components that will fail, restart, or go offline. On the other hand, will it matter if something goes wrong, and all the workloads are still running successfully? Very likely we’ll be okay with it.

We are used to talking about reliability at a high level, in terms of certain uptime to provide some guarantees for availability or fault tolerance. This is often enough for most decision makers when choosing a technology. But how does the cloud actually provide this availability? I will explore this question in detail in this article.

Let’s get to the very core of it. What causes unavailability? Failures - machine failures, software failures, network failures - the reality of distributed systems. Them, and our inability to handle them. So how do we detect and handle failures? Decades of research and experiments shaped the way we approach them in modern cloud systems.

Legacy apps holding you back? Learn how you can leverage microservices to modernize your apps in a progressive approach. **[Register now](https://www.infoq.com/infoq/url.action?i=1709e436-9a9a-4629-b6a4-65052ab3ff2c&t=f).**

I will start with the theory behind failure detection, and then review a couple of real-world examples of how the mechanism works in a real cloud - on Azure. Even though this article includes real-world applications of failure detection within Azure, the same notions could also apply to GCP, AWS, or any other distributed system.

Why should you care?

This is interesting, but why should you care? Customers don’t care how exactly things are implemented under the hood, all they want is for their systems to be up and running. The industry is indeed moving towards creating abstractions and making it much easier for the engineers to work with technologies, ultimately focusing on what needs to be done to solve business problems. As Corey Quinn wrote in his recent article:

"I care about price, and I care about performance. But if you’re checking both of those boxes, then I don’t particularly care whether you’ve replaced the processors themselves with very fast elves with backgrounds in accounting."

This is true for the absolute majority of the end users.

For technical engineering leaders and decision makers, it can be crucial to understand the specifics of what’s at the core of the system and how it works to provide the promised guarantees. Transparency around the internals can provide a better insight into further development of such systems and their future perspective, valuable for better long-term decisions and alignment. I gave a keynote talk at O’Reilly Velocity about why this is true, if you are curious to learn more, or you can read a summary here.

Theoretical Tale of Failure Detectors

Unreliable Failure Detectors For Reliable Distributed Systems

The paper by Chandra and Toueg has been groundbreaking for distributed systems research and is a useful source of information on the topic, which I highly recommend for reading.

Failure detectors

Failure detectors are one of the essential techniques to discover node crashes or failures in a cluster in a distributed system. It helps processes in a distributed system to change their action plan when they face such failures.

For example, failure detectors can help a coordinator node to avoid the unsuccessful attempt of requesting data from a node that crashed. With failure detectors, each node can know if any other nodes in the cluster crashed. Having this information, each node has the power to decide what to do in case of the detected failures of other nodes. For example, instead of contacting the main data node, the coordinator node could decide to contact one of the healthy secondary replicas of the main data node and provide the response to the client.

Failure detectors don’t guarantee the successful execution of client requests. They help nodes in the system to be aware of known crashes of other nodes and avoid continuing the path of failure. Failure detectors collect and provide information about node failures. It’s up to the distributed system logic to decide how to use it. If the data is stored redundantly across several nodes, the coordinator can choose to contact alternative nodes to execute the request. In other cases, there might be failures that could affect enough replicas, then the client request isn’t guaranteed to succeed.

Applications of failure detectors

Many distributed algorithms rely on failure detectors. Even though failure detectors can be implemented as an independent component and used for things like reliable request routing, failure detectors are widely used internally for solving agreement problems, consensus, leader election, atomic broadcast, group membership, and other distributed algorithms.

Failure detectors are important for consensus and can be applied to improve reliability and help distinguish between nodes that have delays in their responses and those that crashed. Consensus algorithms can benefit from using failure detectors that estimate which nodes have crashed, even when the estimate isn’t a hundred percent correct.

Failure detectors can improve atomic broadcast, the algorithm that makes sure messages are processed in the same order by every node in a distributed system. They can also be used to implement group membership algorithms, detectors, and in k-set agreement in asynchronous dynamic networks.

Failure Suspicions

  • Because of the differences in the nature of environments our systems run in, there are several types of failure detectors we can use. In a synchronous system, a node can always determine if another node is up or down because there’s no nondeterministic delay in message processing. In an asynchronous system, we can’t make an immediate conclusion that a certain node is down if we didn’t hear from it. What we can do is start suspecting it’s down. This gives the suspected failed node a chance to prove that it’s up and didn’t actually fail, just taking a bit longer than usual to respond. After we gave the suspected node enough chances to reappear, we can start permanently suspecting it, making the conclusion that the target node is down.

cloud failure cosmos azure devops

What is Geek Coin

What is GeekCash, Geek Token

Best Visual Studio Code Themes of 2021

Bootstrap 5 Tutorial - Bootstrap 5 Crash Course for Beginners

Nest.JS Tutorial for Beginners

Hello Vue 3: A First Look at Vue 3 and the Composition API

Multi-cloud Spending: 8 Tips To Lower Cost

Mismanagement of multi-cloud expense costs an arm and leg to business and its management has become a major pain point. Here we break down some crucial tips to take some of the management challenges off your plate and help you optimize your cloud spend.

How to Extend your DevOps Strategy For Success in the Cloud?

DevOps and Cloud computing are joined at the hip, now that fact is well appreciated by the organizations that engaged in SaaS cloud and developed applications in the Cloud. During the COVID crisis period, most of the organizations have started using cloud computing services and implementing a cloud-first strategy to establish their remote operations. Similarly, the extended DevOps strategy will make the development process more agile with automated test cases.

AWS DevOps vs Azure DevOps | Difference Between AWS Devops And Azure Devops

This Edureka "AWS DevOps vs Azure DevOps" video will give a detailed comparison of how AWS and Azure fare in handling and supporting DevOps approach on the respective cloud platforms along with latest trends and numbers in the domain.

Building an Azure DevOps-based ARM CI/CD for Azure Cloud

Learn how to building an Azure DevOps-based ARM CI/CD for Azure Cloud. This blog series focuses on presenting complex DevOps projects as simple and approachable via plain language and lots of pictures.

Create, Build, Deploy and Configure an Azure Function with Azure DevOps and Azure CLI

How to create, build, deploy and configure an Azure Function using Azure DevOps, Azure CLI and Powershell.