Ruthie  Bugala

Ruthie Bugala


Can We Trust the Cloud Not to Fail?

  • It is crucial for technical decision-makers to understand the specifics of what’s at the core of systems and how they work to provide the promised guarantees.
  • Failure detectors are one of the essential techniques to discover failures in a distributed system. There are different types of failure detectors offering different guarantees, depending on their completeness and accuracy.
  • In practical real-world systems, a lot of failure detectors are implemented using heartbeats and timeouts.
  • To achieve at least weak accuracy, the timeout value should be chosen so that a node doesn’t receive false suspicions. It can be done by changing the timeout and increasing its value after each false suspicion.
  • Service Fabric is one of the examples of systems implementing failure detection - it uses a lease mechanism, similar to heartbeats. Azure Cosmos DB relies on Service Fabric.

In short, we can’t trust the Cloud to never fail. There are always some underlying components that will fail, restart, or go offline. On the other hand, will it matter if something goes wrong, and all the workloads are still running successfully? Very likely we’ll be okay with it.

We are used to talking about reliability at a high level, in terms of certain uptime to provide some guarantees for availability or fault tolerance. This is often enough for most decision makers when choosing a technology. But how does the cloud actually provide this availability? I will explore this question in detail in this article.

Let’s get to the very core of it. What causes unavailability? Failures - machine failures, software failures, network failures - the reality of distributed systems. Them, and our inability to handle them. So how do we detect and handle failures? Decades of research and experiments shaped the way we approach them in modern cloud systems.

Legacy apps holding you back? Learn how you can leverage microservices to modernize your apps in a progressive approach. Register now.

I will start with the theory behind failure detection, and then review a couple of real-world examples of how the mechanism works in a real cloud - on Azure. Even though this article includes real-world applications of failure detection within Azure, the same notions could also apply to GCP, AWS, or any other distributed system.

Why should you care?

This is interesting, but why should you care? Customers don’t care how exactly things are implemented under the hood, all they want is for their systems to be up and running. The industry is indeed moving towards creating abstractions and making it much easier for the engineers to work with technologies, ultimately focusing on what needs to be done to solve business problems. As Corey Quinn wrote in his recent article:

“I care about price, and I care about performance. But if you’re checking both of those boxes, then I don’t particularly care whether you’ve replaced the processors themselves with very fast elves with backgrounds in accounting.”

This is true for the absolute majority of the end users.

For technical engineering leaders and decision makers, it can be crucial to understand the specifics of what’s at the core of the system and how it works to provide the promised guarantees. Transparency around the internals can provide a better insight into further development of such systems and their future perspective, valuable for better long-term decisions and alignment. I gave a keynote talk at O’Reilly Velocity about why this is true, if you are curious to learn more, or you can read a summary here.

Theoretical Tale of Failure Detectors

Unreliable Failure Detectors For Reliable Distributed Systems

The paper by Chandra and Toueg has been groundbreaking for distributed systems research and is a useful source of information on the topic, which I highly recommend for reading.

Failure detectors

Failure detectors are one of the essential techniques to discover node crashes or failures in a cluster in a distributed system. It helps processes in a distributed system to change their action plan when they face such failures.

For example, failure detectors can help a coordinator node to avoid the unsuccessful attempt of requesting data from a node that crashed. With failure detectors, each node can know if any other nodes in the cluster crashed. Having this information, each node has the power to decide what to do in case of the detected failures of other nodes. For example, instead of contacting the main data node, the coordinator node could decide to contact one of the healthy secondary replicas of the main data node and provide the response to the client.

Failure detectors don’t guarantee the successful execution of client requests. They help nodes in the system to be aware of known crashes of other nodes and avoid continuing the path of failure. Failure detectors collect and provide information about node failures. It’s up to the distributed system logic to decide how to use it. If the data is stored redundantly across several nodes, the coordinator can choose to contact alternative nodes to execute the request. In other cases, there might be failures that could affect enough replicas, then the client request isn’t guaranteed to succeed.

Applications of failure detectors

Many distributed algorithms rely on failure detectors. Even though failure detectors can be implemented as an independent component and used for things like reliable request routing, failure detectors are widely used internally for solving agreement problems, consensus, leader election, atomic broadcast, group membership, and other distributed algorithms.

Failure detectors are important for consensus and can be applied to improve reliability and help distinguish between nodes that have delays in their responses and those that crashed. Consensus algorithms can benefit from using failure detectors that estimate which nodes have crashed, even when the estimate isn’t a hundred percent correct.

Failure detectors can improve atomic broadcast, the algorithm that makes sure messages are processed in the same order by every node in a distributed system. They can also be used to implement group membership algorithms, detectors, and in k-set agreement in asynchronous dynamic networks.

Failure Suspicions

  • Because of the differences in the nature of environments our systems run in, there are several types of failure detectors we can use. In a synchronous system, a node can always determine if another node is up or down because there’s no nondeterministic delay in message processing. In an asynchronous system, we can’t make an immediate conclusion that a certain node is down if we didn’t hear from it. What we can do is start suspecting it’s down. This gives the suspected failed node a chance to prove that it’s up and didn’t actually fail, just taking a bit longer than usual to respond. After we gave the suspected node enough chances to reappear, we can start permanently suspecting it, making the conclusion that the target node is down.

#cloud #failure #cosmos #azure #devops

What is GEEK

Buddha Community

Can We Trust the Cloud Not to Fail?
Adaline  Kulas

Adaline Kulas


Multi-cloud Spending: 8 Tips To Lower Cost

A multi-cloud approach is nothing but leveraging two or more cloud platforms for meeting the various business requirements of an enterprise. The multi-cloud IT environment incorporates different clouds from multiple vendors and negates the dependence on a single public cloud service provider. Thus enterprises can choose specific services from multiple public clouds and reap the benefits of each.

Given its affordability and agility, most enterprises opt for a multi-cloud approach in cloud computing now. A 2018 survey on the public cloud services market points out that 81% of the respondents use services from two or more providers. Subsequently, the cloud computing services market has reported incredible growth in recent times. The worldwide public cloud services market is all set to reach $500 billion in the next four years, according to IDC.

By choosing multi-cloud solutions strategically, enterprises can optimize the benefits of cloud computing and aim for some key competitive advantages. They can avoid the lengthy and cumbersome processes involved in buying, installing and testing high-priced systems. The IaaS and PaaS solutions have become a windfall for the enterprise’s budget as it does not incur huge up-front capital expenditure.

However, cost optimization is still a challenge while facilitating a multi-cloud environment and a large number of enterprises end up overpaying with or without realizing it. The below-mentioned tips would help you ensure the money is spent wisely on cloud computing services.

  • Deactivate underused or unattached resources

Most organizations tend to get wrong with simple things which turn out to be the root cause for needless spending and resource wastage. The first step to cost optimization in your cloud strategy is to identify underutilized resources that you have been paying for.

Enterprises often continue to pay for resources that have been purchased earlier but are no longer useful. Identifying such unused and unattached resources and deactivating it on a regular basis brings you one step closer to cost optimization. If needed, you can deploy automated cloud management tools that are largely helpful in providing the analytics needed to optimize the cloud spending and cut costs on an ongoing basis.

  • Figure out idle instances

Another key cost optimization strategy is to identify the idle computing instances and consolidate them into fewer instances. An idle computing instance may require a CPU utilization level of 1-5%, but you may be billed by the service provider for 100% for the same instance.

Every enterprise will have such non-production instances that constitute unnecessary storage space and lead to overpaying. Re-evaluating your resource allocations regularly and removing unnecessary storage may help you save money significantly. Resource allocation is not only a matter of CPU and memory but also it is linked to the storage, network, and various other factors.

  • Deploy monitoring mechanisms

The key to efficient cost reduction in cloud computing technology lies in proactive monitoring. A comprehensive view of the cloud usage helps enterprises to monitor and minimize unnecessary spending. You can make use of various mechanisms for monitoring computing demand.

For instance, you can use a heatmap to understand the highs and lows in computing visually. This heat map indicates the start and stop times which in turn lead to reduced costs. You can also deploy automated tools that help organizations to schedule instances to start and stop. By following a heatmap, you can understand whether it is safe to shut down servers on holidays or weekends.

#cloud computing services #all #hybrid cloud #cloud #multi-cloud strategy #cloud spend #multi-cloud spending #multi cloud adoption #why multi cloud #multi cloud trends #multi cloud companies #multi cloud research #multi cloud market

Adaline  Kulas

Adaline Kulas


What are the benefits of cloud migration? Reasons you should migrate

The moving of applications, databases and other business elements from the local server to the cloud server called cloud migration. This article will deal with migration techniques, requirement and the benefits of cloud migration.

In simple terms, moving from local to the public cloud server is called cloud migration. Gartner says 17.5% revenue growth as promised in cloud migration and also has a forecast for 2022 as shown in the following image.

#cloud computing services #cloud migration #all #cloud #cloud migration strategy #enterprise cloud migration strategy #business benefits of cloud migration #key benefits of cloud migration #benefits of cloud migration #types of cloud migration

Google Cloud: Caching Cloud Storage content with Cloud CDN

In this Lab, we will configure Cloud Content Delivery Network (Cloud CDN) for a Cloud Storage bucket and verify caching of an image. Cloud CDN uses Google’s globally distributed edge points of presence to cache HTTP(S) load-balanced content close to our users. Caching content at the edges of Google’s network provides faster delivery of content to our users while reducing serving costs.

For an up-to-date list of Google’s Cloud CDN cache sites, see

Task 1. Create and populate a Cloud Storage bucket

Cloud CDN content can originate from different types of backends:

  • Compute Engine virtual machine (VM) instance groups
  • Zonal network endpoint groups (NEGs)
  • Internet network endpoint groups (NEGs), for endpoints that are outside of Google Cloud (also known as custom origins)
  • Google Cloud Storage buckets

In this lab, we will configure a Cloud Storage bucket as the backend.

#google-cloud #google-cloud-platform #cloud #cloud storage #cloud cdn

Zelma  Gerlach

Zelma Gerlach


Cloud Operations Overview for Google Cloud Professional Architect

Operations Suite (Stackdriver) is a hybrid monitoring, logging, and diagnostics tool suite for applications on the Google Cloud Platform and AWS.

GCP Purchased Stackdriver and was rebranded to Google Stackdriver after the purchase.

Google has now rebranded the Stackdriver Suite as “Cloud Operations” This is important to know in case the exam has not been updated to reflect the change.

Cloud Operations monitors the clouds service layers in a single SaaS solution. Cloud Operations maintains native integration with Google Cloud data tools BigQuery, Cloud Pub/Sub, Cloud Storage, Cloud Datalab, and out-of-the-box integration with all your other application components.

In a nutshell Cloud Operations Suite allows you to Monitor, troubleshoot, and improve application performance on your Google Cloud environment.

#google-cloud-platform #google-cloud #cloud-computing #cloud-architecture #cloud

Zelma  Gerlach

Zelma Gerlach


Top 7 Google Cloud Security Capabilities to Implement in your GCP Cloud


Ever since the advent of Google Cloud, there has been an increased amount of services to facilitate customers and business requirements no matter what the enterprise domain is.

Google has put its efforts in coming up with solutions and products that not only fit the current user needs but also cater for future business needs.

That’s precisely why companies opt for Google Cloud Products as their prime cloud services for their business operations.

Nevertheless, another thing that is of much interest is the amount of “Security” baked into these Google products. There are certainly some significant considerations when deploying anything in the cloud.

#google-cloud #google-cloud-platform #cloud-computing #cloud-security #cloud