Service Reliability

Reliability of services is crucial in most applications, and everyone should aim for their services to be reliable. However, reliability can be very expensive, so we should know how to manage it properly.

Creating an overly-reliable system can be costly both in terms of money needed to achieve it, as well as in lack of time to focus on other important aspects of a product (e.g. new features).

Since there are many factors that play into the reliability of a system, its almost impossible to measure them all, so we can decide to focus on unplanned downtime as a most encompassing risk our services face. With this in mind, two main parameters that we need to know in order to be able to manage service reliability are how available our services are (Measuring Service AvailabilityMeasuring Service Availability
In order to know how available your service is, you need a way to measure it. One of the most straightforward ways to measure this is by measuring uptime:

availability = uptime / (uptime + downtim...) and what is our Service Availability TargetService Availability Target
When deciding the level of availability we want for our services, the target that we want to achieve is often described as a percentage of time the service is available.

It's worth noting that 100....

Since we never want (except for some extreme cases) to have 100% available system, we can manage the expected unreliability by utilizing Error BudgetsError Budgets
It's difficult for product and ops teams to find middle ground between investing in reliability vs taking risks. If you test your software too much before releasing, you are going too slow and the ....

Status: #💡

References:

Book - Site Reliability Engineering (Source)

Service Reliability

Links to this note

SRE