Prometheus Alert Debouncing

Usually Prometheus alerts look like this:

- alert: MetricIsCritical
  for: 5m
  expr: METRIC > 100

However, it can be quite tricky to create an alert rule for a noisy metric:

flapping.svg

A situation like that isn't desirable, because often alerts automatically create incidents, and in this case you will get two notifications instead of just one. If metric is flipping every few minutes, then, too bad, you'll be receiving alerts every few minutes, until you fix the problem. This can be very annoying. Ideally, there should be just a single alert.

Historically Prometheus had two knobs that we can play with: threshold and for. Here is what we can do:

  • Increase threshold. This doesn't really help: either threshold becomes too high, so that it never triggers an alert (pic. 1), or it doesn't really change situation and we will get several alerts (pic. 2).
  • Increase for. for delays alert triggering for some time, e.g. for: 5m ensures that metric is in a critical state for 5 minutes continuously and triggers an alert only then. This too doesn't always help: if it's too high, it won't trigger anything (pic. 3), otherwise we will get several alerts (pic. 4).
  • The best way would be reduce threshold, but then the rule might be too sensitive, i.e. it will pick situations, that we don't consider as an incident (pic. 5).
tuning.svg

That is, both threshold and for affect sensitivity, not debouncing. So we need something else like recently added keep_firing_for field (available since Prometheus 2.42) or hysteresis.

keep_firing_for keeps an alert in a firing state for specified time, and if condition is meet again during this time, no new alert is created. Pic. 6 shows an example for the alert below:

- alert: MetricIsCritical
  keep_firing_for: 5m
  expr: METRIC > 100

The idea of hysteresis in monitoring is to have two thresholds instead of one: an alert is triggered when the first threshold is crossed, but it will be resolved only when the second threshold is crossed:

debounce.svg

Unfortunately, Prometheus doesn't syntactically support hysteresis, but there are some plans. Today it can be done using a self-referencing alert expression:

- alert: MetricIsCritical
  expr: |
    METRIC > 100
    or (
        METRIC > 50
        and on()
        ALERTS{alertname="MetricIsCritical",alertstate="firing"}
    )

First of all, an alert (just like a recording rule) produces a metric, e.g. in this case ALERTS{alertname="MetricIsCritical",alertstate="firing"} keeps the whole history when this alert was fired: this timeseries has 1, if alert was firing at this time; otherwise no sample at all.

Now, there is nothing that stops us from referencing this metric: when Prometheus evaluates expr, it will take the latest available ALERTS{...} sample, compute the rest of expression, and only afterwards adds new sample to ALERTS{...}.

All what's left is just technical: use logical operators to shape the condition when alert needs to be fired. In this case, the alert starts to fire when METRIC crosses 100, and it stops firing only when METRIC becomes less than 50.

Here is a real world example of how a disk usage might be monitored:

- rule: persistentvolumeclaim:kubelet_volume_utilization:
  expr: |
    kubelet_volume_stats_used_bytes
    / on (namespace, persistentvolumeclaim)
    kubelet_volume_stats_capacity_bytes

- alert: DiskIsNearlyFull
  expr: |
    persistentvolumeclaim:kubelet_volume_utilization: > 90
    or (
        persistentvolumeclaim:kubelet_volume_utilization: > 80
        and on(namespace, persistentvolumeclaim)
        ALERTS{alertname="DiskIsNearlyFull"}
    )

This technique is very powerful yet fragile: dealing with binary operations and labels might be very trickly. Here are my opinion when to use each:

  • keep_firing_for should be probably a default choice if you have some kind of oscillating metric like metrics for GC, cache flush, or DB compaction. In this case, keep_firing_for should be set to at least one oscillation period.
  • Hysteresis should be used with care, when keep_firing_for is not enough. For example, it might not be very practical to set very long keep_firing_for durations, perhaps if it's longer than 30 minutes, it probably makes sense to use a hysteresis technique. Or if metric is a bit volatile and we're looking for an initial "boom" that is followed by somewhat trailing values, similar to what is shown on pic. 7.
—pk,  #prometheus