Prometheus Daily Historic Average

It's quite useful to have a typical daily profile in your monitoring dashboards. For example, it can be a good idea to check how much CPU or memory is used after deployment vs. how it was in the past.

A simple way would be to plot today's and yesterday's values on the same panel. That can be done with two queries:

  • sum(rate(container_cpu_usage_seconds_total[1m]))
  • sum(rate(container_cpu_usage_seconds_total[1m] offset 24h))

This works, but yesterday's values can be too noisy, and if some incident happened yesterday, it will be very visible today, which is not ideal. avg_over_time can help with smoothing, but it's not desired, if you have peaks or falls at a certain minutes of the day: maybe you have a cronjob, or maybe a partner reaches your API by schedule. So instead, for every minute of the day you can compute an average of this minute from a few previous days. Here is a recording rule for that:

- rule: :container_cpu_usage_seconds_total:rate1m
  expr: |
    sum(rate(container_cpu_usage_seconds_total[1m]))

- rule: :container_cpu_usage_seconds_total:rate1m_avg7d
  expr: |
    (
      :container_cpu_usage_seconds_total:rate1m +
      :container_cpu_usage_seconds_total:rate1m offset 1d +
      :container_cpu_usage_seconds_total:rate1m offset 2d +
      :container_cpu_usage_seconds_total:rate1m offset 3d +
      :container_cpu_usage_seconds_total:rate1m offset 4d +
      :container_cpu_usage_seconds_total:rate1m offset 5d +
      :container_cpu_usage_seconds_total:rate1m offset 6d
    ) / 7

# Dashboard queries:
# :container_cpu_usage_seconds_total:rate1m
# :container_cpu_usage_seconds_total:rate1m_avg7d offset 24h

This gives you a weekly average for every minute during a day. This query is, however, problematic: it's quite long and it will take 8 days until it will produce something, given that both recording rules are created at the same time. It's possible to have a bunch of or's to fallback on current time, but that would complicate the query even more. Instead, you can use an exponential smoothing:

- rule: :container_cpu_usage_seconds_total:rate1m
  expr: |
    sum(rate(container_cpu_usage_seconds_total[1m]))

# Average CPU at the same time of the day over last the last days
- rule: :container_cpu_usage_seconds_total:rate1m_mavg
  expr: |
    (
      1 * :container_cpu_usage_seconds_total:rate1m +
      6 * :container_cpu_usage_seconds_total:rate1m_mavg offset 1d
    ) / 7 or :container_cpu_usage_seconds_total:rate1m

# Dashboard queries:
# :container_cpu_usage_seconds_total:rate1m
# :container_cpu_usage_seconds_total:rate1m_mavg offset 24h

This one is much faster to compute compared to a true moving average. It takes exactly one day for metric values to appear on the dashboards (assuming the use of offset 24h). Another benefit is that this scheme can work with a very small prometheus retention: even if you have just 2 days of retention, it will still produce smooth curve comparable to 7 day moving average.

—pk,  #prometheus