High Availability

High availability is the description of a system designed to be fault-tolerant, highly dependable, operates continuously without intervention, or having a single point of failure. These systems are highly sought after to increase the availability and uptime required to keep an infrastructure running without issue. The following characteristics define a High Availability system.

High Availability Clustering

High-availability server clusters (aka HA Clusters) is defined as a group of servers which support applications or services that can be utilized reliably with a minimal amount of downtime. These server clusters function using a type of specialized software that utilizes redundancy to achieve mission-critical levels of five9’s uptime. Currently, approximately 60% of businesses require five9’s or greater to provide vital services for their businesses.

High availability software capitalizes on the redundant software installed on multiple systems by grouping or clustering together a group of servers focusing on a common goal in case components fail. Without this form of clustering, if the application or website crashes, the service will not be available until the servers are repaired. HA clustering addresses these situations by detecting the faults and quickly restarting or replacing the server or service or server with a new process that does not require human intervention. This is defined as a “failover” model.

The illustration below demonstrates a simple two node high availability cluster.

2nodeHAcluster

High Availability clusters are often used for mission-critical databases, data sharing, applications, and e-commerce websites spread over a network. High Availability implementations build redundancy within a cluster to remove any one single point of failure, including across multiple network connections and data storage, which can be connected redundantly via geographically diverse storage area networks.

High Availability clustered servers usually use a replication methodology called Heartbeat that is used to monitor each node’s status and health within the cluster over a private network connection. One critical circumstance all clustering software must be able to address is called split-brain, which occurs when all private internal links go down simultaneously, but the nodes in the cluster continue to run. If this occurs, every node within the cluster may incorrectly determine that all the other nodes have gone down and attempt to start services that other nodes may still be running. This condition of duplicate instances running similar services, which could cause data corruption on the system.

ha.cluster

A typical version of high availability software provides attributes that include both hardware and software redundancy. These features include:

  • The automatic detection and discovery of hardware and software components.
  • Autonomous assignment of both active and contingent roles to new elements.
  • Detection of failed software services, hardware components, and other system constructs.
  • Monitoring and notification of redundant components and when they need to be activated.
  • Ability to scale the cluster to accommodate the required changes without external intervention.

Fault tolerance

fault.tolerance

Fault tolerance is defined as the ability for a system’s infrastructure to foresee and withstand errors and provide an automatic response to those issues if encountered. The primary quality of these systems is advanced design factors, which can be called upon should a problem occur. Being able to configure an infrastructure that envisions every possible solution is a considerable task that involves the knowledge and experience to counter the multiple concerns before they occur. System architects who design such frameworks will have the methodologies which envision the means to alleviate these problems in advance, and the ability to implement these frameworks.

The following redundancy methodologies are available and should be reviewed during the initial stages of design and implementation.

  • N + 1 Model – This concept infers the sum of equipment needed (which we will refer to as ‘N’) to keep the entire framework up and running, with an additional independent component backup for each of the ‘N’ components in case of failure.
  • N + 2 Model – Similar to the N + 1 model but with an additional layer of protection if two components should fail.
  • 2N Model – This modality has a dual redundant backup for each element to ensure the system’s framework is fully functional.
  • 2N + 1 Model – Again, this model is similar to the 2N model but with a supplemental component to add a tertiary layer of protection to the system’s framework.

As models progress from Nx to 2Nx, the cost factor also increases exponentially as for truly redundant systems that require uptime. These modalities are critical for stability and availability.

Dependability and Reliability

One of the central tenants of a high availability system is uptime. Uptime is of premier importance, especially if the purpose of a system is to provide an essential service like the 911 systems that respond to emergent situations. In business, having a high availability system is required to ensure a vital service remains online. One example would be an ISP or other service that cannot tolerate a loss of function. These systems must be designed with high availability and fault tolerance to ensure reliability and availability while minimizing downtime.

Orchestrated Error Handling

Should an error occur, the system will adapt and compensate for the issue while remaining up and online. Building this type of system requires forethought and planning for the unexpected. Being able to foresee the problems in advance, and planning for their resolution is one of the main qualities of a high availability system.

Scalability

Should the system encounter an issue like a traffic spike or an increase in resource usage, the system’s ability to scale to meet those needs should be automatic and immediate. Building features like these into the system will provide the system’s ability to respond quickly to any change in the systemic functionality of the architectures processes.

Availability & Five 9’s Uptime

Five 9’s is the industry standard of measure of uptime. This measurement can be related to the system itself, the system processes within a framework, or the program operating inside an infrastructure. This estimation is often related to the program being delivered to clients in the form or a website or web application. A systems Availability can be measured as the percentage of time that systems are available by using this equation: x = (n – y) * 100/n. This formula denotes that where “n” is the total amount of minutes within a calendar month, and “y” is the amount of minutes that service is inaccessible within a calendar month. The table below outlines downtime related to the percentage of “9’s” represented.

**Availability %**90%

(“1 Nine“)99%

(“2 Nines“)99.9%

(“3 Nines“)99.99%

(“4 Nines“)99.999%

(“5 Nines“)Downtime/Year36.53 days3.65 days8.77 hours52.60 minutes5.26 minutes

As we can see, the higher the number of “9’s”, the more uptime is provided. A high availability system’s goal is to achieve a minimal amount of potential downtime to ensure the system is always available to provide the designated services.

Heartbeat

One of the main High Availability components is called Heartbeat. Heartbeat is a daemon which works with a cluster management software like Pacemaker that is designed specifically for high-availability clustering resource management. Its most important characteristics are:

  • No specific or fixed maximum number of nodes – Heartbeat can be used to build large clusters as well as elementary ones.
  • Resource monitoring: resources can be automatically restarted or moved to another node on failure.
  • A fencing mechanism needed to remove failed nodes from the cluster.
  • A refined policy-based resource management, resource inter-dependencies, and constraints.
  • A time-based rule set to allow for different policies depending on a defined timeframe.
  • A group of resource scripts (for software like Apache, DB2, Oracle, PostgreSQL, etc.) included more granular management.
  • A GUI for configuring, controlling and monitoring resources and nodes.

Cluster Architecture

**Engineered Availability **

The first segment of a highly available system is the clearly designed utilization of clustered application servers that are engineered in advance to distribute load amongst the whole cluster, which includes the ability to failover to a secondary and possibly a tertiary system.

The second division includes the need for database scalability. This entails the requirement of scaling, either horizontally or vertically, using multiple master replication, and a load balancer to improve the stability and uptime of the database.

ha cluster

#tutorials #2nx models #architecture #autonomous #availability #backups #best practice #clustering #deployment #design #disaster recovery #downtime #engineered #fault tolerance #ha cluster #heartbeat #high availability #infrastructure #monitoring #node #nx models #orchestrated #pacemaker #redundancy #reliability #replication #scalability #single point of failure #split brain #system #testing #uptime

What is High Availability? A Tutorial | Liquid Web
2.75 GEEK