When you are building a software application or a service, I’m sure you’ve heard of these big words: scalabilitymaintainability, and reliability. Everyone talks about it.

Image for post

Everyone just throws these words at each other without really knowing what individual terms actually mean. Today we will tackle what reliability really means in depth so you don’t freeze when the interviewer asks you sth like: so “how do you make a reliable application?”

What is Reliability

Image for post

We know one thing for sure. We know when a service is unreliable:

- Netflix not working when you want to watch a movie or a tv show

- Uber, not working correctly when you need to request a ride.

Image for post

Unreliable services yield bad user experience and sometimes it can even lead to mad user experience as well.

Here are some typical expectations of reliable services:

  • Performs the function that the user expects
  • Tolerate the user making mistakes or using the software in unexpected ways.
  • Provide good performance (e.g., good latency) under the expected load and data volume)
  • Prevents any unauthorized access or abuse.

It basically means that the service or application should continue to work correctly even when things go wrong!

What Can Go Wrong?

The things that can go wrong are called faults and there are mainly three different types of faults.

  • Hardware Faults
  • Software Faults
  • Human Faults

Hardware Faults

We usually think of hardware faults as being random and independent from each other: one machine’s disk failing does not imply that another machine’s disk is going to fail.

There may be weak correlations (for example due to a common cause, such as the temperature in the server rack), but otherwise it is unlikely that a large number of hardware components will fail at the same time.

Here are different types of hardware faults

  • RAM can become faulty
  • Someone Unplugs the wrong network cable
  • Power grid blackout

Hard disks are reported as having a mean time to failure of about 10 to 50 years. Thus, on a storage cluster with 100000 disks, we should expect on average 10 disks to die per day.

Image for post

Our first response is usually to add redundancy to the individual hardware components in order to reduce the failure rate of the system. For instance, you can have:

  • RAID configuration for your disks
  • Batteries & Diesel generators for backup power
  • Dual Power supplies
  • Hot swappable CPU

This approach cannot completely prevent hardware faults from causing failures but it definitely reduces the probability of failures. However, as the data volumes have increased significantly, more applications have begun using larger numbers of machines, which proportionally increases the rate of hardware faults. Hence, the industry started using software fault-tolerance techniques in preference or in addition to hardware redundancy.

Software Faults

Software faults are generally harder to anticipate and it can cause broader damage to the overall system or application because they are correlated across nodes.

Here are different types of software faults:

  • A software bug that causes every instance of an application server to crash when a given a particular bad input
  • Cascading failures, where a small fault in one component triggers a fault in another component.
  • A software bug that corrupts shared resource (e.g., CPU, memory, disk space, or network bandwidth)
  • A service that the system depends on that slows down, becomes unresponsive, or starts returning corrupted responses.

Most of these bugs are created when programmers make some kind of assumption about the software environment and it suddenly stops being true for some reason.

#software #interview #reliability #reliability-engineering #programming

What is Reliability Service?
1.10 GEEK