When you are building a software application or a service, I’m sure you’ve heard of these big words: scalability, maintainability, and reliability. Everyone talks about it.
Everyone just throws these words at each other without really knowing what individual terms actually mean. Today we will tackle what reliability really means in depth so you don’t freeze when the interviewer asks you sth like: so “how do you make a reliable application?”
We know one thing for sure. We know when a service is unreliable:
- Netflix not working when you want to watch a movie or a tv show
- Uber, not working correctly when you need to request a ride.
Unreliable services yield bad user experience and sometimes it can even lead to mad user experience as well.
Here are some typical expectations of reliable services:
It basically means that the service or application should continue to work correctly even when things go wrong!
The things that can go wrong are called faults and there are mainly three different types of faults.
We usually think of hardware faults as being random and independent from each other: one machine’s disk failing does not imply that another machine’s disk is going to fail.
There may be weak correlations (for example due to a common cause, such as the temperature in the server rack), but otherwise it is unlikely that a large number of hardware components will fail at the same time.
Here are different types of hardware faults
Hard disks are reported as having a mean time to failure of about 10 to 50 years. Thus, on a storage cluster with 100000 disks, we should expect on average 10 disks to die per day.
Our first response is usually to add redundancy to the individual hardware components in order to reduce the failure rate of the system. For instance, you can have:
This approach cannot completely prevent hardware faults from causing failures but it definitely reduces the probability of failures. However, as the data volumes have increased significantly, more applications have begun using larger numbers of machines, which proportionally increases the rate of hardware faults. Hence, the industry started using software fault-tolerance techniques in preference or in addition to hardware redundancy.
Software faults are generally harder to anticipate and it can cause broader damage to the overall system or application because they are correlated across nodes.
Here are different types of software faults:
Most of these bugs are created when programmers make some kind of assumption about the software environment and it suddenly stops being true for some reason.
#software #interview #reliability #reliability-engineering #programming