production

How we improved on-call life by reducing pager noise

To monitor the health of GitLab.com we use multiple SLIs for each service. We then page the on-call when one of these SLIs is not meeting our internal SLOs and burning through the error budget with the hopes of fixing the problem before too many of our users even notice. All of our services SLIs and SLOs are defined using jsonnet in what we call the metrics-catalog where we specify a service and its SLIs/SLOs. For example, the web-pages service has an apdex SLO of 99.5% and multiple SLIs such as loadbalancer, go server, and time to write HTTP headers. Having these in code we can automatically generate Prometheus recording rules and alerting rules following multiple burn rate alerts. Every time we start burning through our 30-day error budget for an SLI too fast we page the SRE on-call to investigate and solve the problem. This setup has been working well for us for over two years now, but one big pain point remained when there was a service-wide degradation. The SRE on-call was getting paged for every SLI associated with a service or its downstream dependencies, meaning they can get up to 10 pages per service since the service has 3-5 SLIs on average and we also have regional and canary SLIs. This gets very distracting, it’s stress-inducing, and it also doesn’t let the on-call focus on solving the problem but just acknowledges pages. For example below we can see the on-call getting paged 11 times in 5 minutes for the same service. What is even worse is when we have a site-wide outage, where the on-call can end up getting 50+ pages because all services are in a degraded state. It was a big problem for the quality of life for the on-call and we needed to fix this. We started doing some research on how to best solve this problem and opened an issue to document all possible solutions. After some time we decided to go with grouping alerts by service and introducing service dependencies for alerting/paging. Group alerts by service The smallest and most effective iteration was to group the alerts by the service. Taking the previous example where the web-pages service paged the on-call 11 times, it should have only paged the on-call once, and shown which SLIs were affected. We use Alertmanager for all our alerting logic, and this already had a feature called grouping so we could group alerts by labels. This is what an alert looks like in our Prometheus setup: ALERTS{aggregation=”regional_component”, alert_class=”slo_violation”, alert_type=”symptom”, alertname=”WebPagesServiceServerApdexSLOViolationRegional”, alertstate=”firing”, component=”server”, env=”gprd”, environment=”gprd”, feature_category=”pages”, monitor=”global”, pager=”pagerduty”, region=”us-east1-d”, rules_domain=”general”, severity=”s2″, sli_type=”apdex”, slo_alert=”yes”, stage=”main”, tier=”sv”, type=”web-pages”, user_impacting=”yes”, window=”1h”} All alerts have the type label attached to them to specify which service they belong to. We can use this label and the env label to group all the production alerts that are firing for the web-pages service. We also had to update our Pagerduty and Slack templates to show the right information. Before we only showed the alert title and description but this had to change since we are now alerting by service rather than by 1 specific SLO. You can see the changes at runbooks!4684. This was already a big win! The on-call now gets a page saying “service web-pages” and then the list of SLIs that are […]

Read More