White Box MonitoringMonitoring
Monitoring is an integral part of running services in production. Without it, we are blind to what's going on, and thus unable to act according to our best interest.
Providing visibility is in the... is when we monitor the internal workings of our system. For example, users have no idea about our current CPU utilization, so that metric is a White Box metric. Its primary use is in collecting telemetry and debugging, but also has an important role in Alerting as well.
White Box Monitoring is useful for setting up strong Cause Based MonitoringCause Based Monitoring
Cause Based [[Monitoring]] points us to a cause of an existing issue, but don't imply that issue exists in the first place. Some examples of Cause Metrics are:
CPU utilization
Free disk space
..., but it can also be used for Symptom Based MonitoringSymptom Based Monitoring
Symptom Based [[Monitoring]] points us to allows us to observe the user experience. A metric is Symptom based if it shows an actual symptom that is making our users happy or sad. We gather Symptom ... in some cases as well, depending on who is looking at it and what data it shows.
When we look at internal metrics which are not directly visible to the user, we can gain insight into issues that are yet to come, and perform Alerting on them. For example, a user doesn't see that a disk is getting fuller, but they will get to feel it once it's completely full. When thinking about alerting on imminent issues, Black Box MonitoringBlack Box Monitoring
Black Box Monitoring is when we look at our system from the perspective of our users – without knowing anything about its internal state.
Since Black Box Monitoring is looking at customer experien... is a much better fit.
Status: #💡
References:
- Book - Site Reliability Engineering (Source)
- Video - Practices for Creating Effective Customer SLOsVideo - Practices for Creating Effective Customer SLOs
Source: InfoQ: Stop Talking & Listen; Practices for Creating Effective Customer SLOs
Status: #🛈/📹/✅
sre workbook chapter 3 has case studies on implementing slos
[[Cause Based Monitor... (Source)