SLI

Service Level Indicators are quantitative measures of provided level of service, often aggregated into rates, averages, percentiles.

Common SLIs are availability, error rate, latency, throughput, durability of data.

We have a lot of metrics we are MonitoringMonitoring
Monitoring is an integral part of running services in production. Without it, we are blind to what's going on, and thus unable to act according to our best interest.

Providing visibility is in the... in our systems, but not all of them should be defined as SLIs. Good SLIs are metrics that our users care about. When we don't have the metrics our users care about the most (e.g. frontend latency), we should find the next best thing that we have available (e.g. backend latency).

If in some cases server-side metrics (e.g. gathered via PrometheusPrometheus
Prometheus is an open source, metrics based [[Monitoring]] system. Its data model is kept as a time series, each consisting of key value pairs called labels.

PromQL is a querying language that all... or log processing) are not good enough, measuring frontend metrics (e.g. time to a full page load) can give us much more information and cover some potential blind spots (e.g. slow js on frontend).

Having too many SLIs defined is guaranteed to cause problems, as you can't possibly pay attention to everything all the time. On the other hand, having too few SLIs could leave big holes in the behavior of your system.

When deciding which indicators to choose, its better to start from desired SLOSLO
Service Level Objectives are values (or ranges of values) in which [[SLI]]s are allowed to be. For example, if [[SLI]] is request latency, SLO could be that request latency should be less than 100m...s and work your way back to indicators. See Defining Service Level ObjectivesDefining Service Level Objectives
The first thing to know when choosing [[SLI]]s and [[SLO]]s is that [[SLO]]s should always be defined first. The thing we want to avoid by this is just picking whatever's easy to measure and ending....

When aggregating metrics for SLIs its much better to use percentiles than the MeanMean
In Statistics, Mean is the thing you normally know as "average". It's calculated by summing all numbers and then dividing the sum by the number of numbers.

// not the actual way to do it, just po..., as mean can hide many things we'd be interested in - if most requests complete within 50ms, but 5% of requests are 20x slower, it'd be hard to notice such a thing on mean chart.

Status: #💡 References:

Book - Site Reliability Engineering (Source)

SLI

Links to this note

Measuring Service Availability

SRE

Defining Service Level Objectives

SLO

Symptom Based Monitoring

What should i be Alerting on