How we improved on-call life by reducing pager noise

To monitor the health of we use multiple
for each service. We then page the on-call when one of these SLIs is not
meeting our internal SLOs and burning through the error

with the hopes of fixing the problem before too many of our users even notice.

All of our services SLIs and SLOs are defined using jsonnet in
what we call the metrics-catalog
where we specify a service and its SLIs/SLOs. For example, the web-pages
service has an apdex SLO of 99.5%
and multiple SLIs such as loadbalancer,
go server,
and time to write HTTP headers.
Having these in code we can automatically generate Prometheus recording rules
and alerting rules
following multiple burn rate alerts.
Every time we start burning through our 30-day error budget for an SLI too fast
we page the SRE on-call to investigate and solve the problem.

This setup has been working well for us for over two years now, but one big
pain point remained when there was a service-wide degradation. The SRE on-call
was getting paged for every SLI associated with a service or its
downstream dependencies, meaning they can get up to 10 pages per service since
the service has 3-5 SLIs on average and we also have regional and canary SLIs.
This gets very distracting, it’s stress-inducing, and it also doesn’t let the
on-call focus on solving the problem but just acknowledges pages. For example
below we can see the on-call getting paged 11 times in 5 minutes for the same

web-pages alert storm

What is even worse is when we have a site-wide outage, where the on-call can
end up getting 50+ pages because all services are in a degraded state.

site wide outage alert storm

It was a big problem for the quality of life for the on-call and we needed to
fix this. We started doing some research on how to best solve this problem and
opened an issue to document all possible
After some time we decided to go with grouping alerts by service and
introducing service dependencies for alerting/paging.

Group alerts by service

The smallest and most effective iteration was to group the alerts by the
service. Taking the previous example where the web-pages service paged the
on-call 11 times, it should have only paged the on-call once, and shown
which SLIs were affected. We use Alertmanager for
all our alerting logic, and this already had a feature called
so we could group alerts by labels.

This is what an alert looks like in our Prometheus setup:

ALERTS{aggregation="regional_component", alert_class="slo_violation", alert_type="symptom", alertname="WebPagesServiceServerApdexSLOViolationRegional", alertstate="firing", component="server", env="gprd", environment="gprd", feature_category="pages", monitor="global", pager="pagerduty", region="us-east1-d", rules_domain="general", severity="s2", sli_type="apdex", slo_alert="yes", stage="main", tier="sv", type="web-pages", user_impacting="yes", window="1h"}

All alerts have the type label attached to them to specify which service they
belong to. We can use this label and the env label to group all the
production alerts that are firing for the web-pages service.

grouping alerts by the `type` and `env` labels

We also had to update our Pagerduty and Slack templates to show the right
information. Before we only showed the alert title and description but this had
to change since we are now alerting by service rather than by 1 specific SLO.
You can see the changes at runbooks!4684.

Before and after on pages

This was already a big win! The on-call now gets a page saying “service
web-pages” and then the list of SLIs that are burning through the error budget – we went from 11 pages to 1 page!

Service Dependencies

However we still had the problem that when a downstream service (such as the database)
starts burning through the error budget, it has a cascading effect where web,
git, and api will also start burning through the error budget and page the
on-call for each service. That was the next thing that we had to solve.

We needed some way to not alert on the api service if the patroni
(database) service was burning through the error budget because it’s clear if the
database is degraded the api service will end up degraded as well. We used
another feature of Alertmanager called
where we can tell Alertmanager to not alert on api if some alerts on patroni
are already firing.

visualization of how inhibit rules work

I’ve mentioned that all of our SLIs/SLOs are inside of the
so it was a natural fit to add dependencies there, and this is exactly what
we did in runbooks!4710. With this
we can specify that an SLI depends on another SLI of a different service which
will automatically create
for Alertmanager.

Since inhibit rules could potentially prevent alerting someone, we’ve used
these sparingly. To avoid creating inhibit rules too broadly, we’ve implemented
the following restrictions:

  1. An SLI can’t depend on an SLI of the same service.
  2. The SLI has to exist for that service.
  3. We only allow equal operations, no regex on SLIs.

After that it was only a matter of adding the dependsOn on each service for example:

  1. web depends on patroni
  2. api depends on patroni
  3. web-pages depends on api

The web-pages inhibit rule shows a chain of dependencies from web-pages ->
api -> patroni
, so if patroni is burning through the error budget it will
not page for api and web-pages services anymore!

How it’s working

We have been using alert grouping and service dependencies for over a month now, and we have already seen some improvements:

  1. The on-call only gets paged once per service.
  2. When there is a large site-wide outage they only get paged 5-10 times since we have external probes that also alert us.
  3. There is an overall downward trend on pages for the on-call as seen below.

pages trend

Cover image by Yaoqi on Unsplash

“Is your on-call getting too many pages? Here’s how we fixed this problem” – steveazz

Click to tweet