'Correctly detecting a change in a Prometheus count metric
I have tried to write a PromQL query for detecting a change in a count metric.
My scrape interval is 15 seconds.
I query the metric like this:
http_server_requests_seconds_count{outcome!="REDIRECTION",outcome!="SUCCESS"}
It shows how many of all http_server_requests that were not redirects and not successful.
My attempt at writing an alert expression using this metric looks like this:
sum by(service, method, outcome, status, uri) (
rate(
http_server_requests_seconds_count{
outcome!="REDIRECTION",
outcome!="SUCCESS"
}[1m]
)
) * 60
My thinking is that the rate for [1m] multiplied by 60 seconds would be 1 when a change occurs, but as far as I can tell I get 2?
These graphs show this clearly:
The top graph is the sum expression, and the bottom graph is the change in server request count. When the bottom graph counts +1, the top graph should temporarily go up to 1 as well (but actually it goes up to 2).
What am I doing wrong? Did I misunderstand something? How can I write a query that gives me the value 1 when a change occurs? Should I expect to be able to write such a query?
Thanks!
Solution 1:[1]
That's because Prometheus prioritizes a consistent definition of what a range is over accuracy. I.e. it always defines a range as all the samples falling within the (inclusive) interval [now() - range, now()]. This definition makes perfect sense for gauges: if you want to compute an avg_over_time() with a time range equal to the step, you want every input sample included in the calculation of exactly one output sample.
But the same is not true for counters. With a time range equal to the step, one input value (i.e. the increase between two successive samples) is essentially discarded. (See Prometheus issues #3746 and 3806 for A LOT more detail.) To make up for the data it throws away, Prometheus uses extrapolation to adjust the result of the calculation.
Meaning that if (as in your case) you use a time range that's 2x your scrape interval (1m range for 30s scrape interval), Prometheus will (on average) find 2 samples in each range, but the actual time range covered by those 2 samples will be around 30s. So Prometheus will helpfully extrapolate the rate to the requested 1m by doubling the value. Hence the result of 2 instead of the expected 1. You'll also notice that because some increases between successive samples are discarded (even though no samples are), not all increases in your counter show up in your rate() graph. (I.e. there is no jump in rate() corresponding to the third counter increase. If you refresh at different times, different increases will appear and disappear. Grafana has "solved" the latter by always aligning requested ranges with the step and thus consistently missing the same increases.)
The solution suggested by Prometheus developers is to compute rates over longer durations. But all that does is reduce the error (you'd get 1.5 with a 3x range, 1.33 with a 4x range, 1.25 for a 5x range, etc.), never getting rid of it. Prometheus' extrapolation is hidden well enough by smoothly increasing counters, but stands out like a sore thumb with counters like your own, that rarely increase.)
The only workaround for this issue (short of fixing Prometheus, for which I've submitted a PR and am maintaining a fork) is to reverse engineer Prometheus' implementation of rate(). I.e. assuming a scraping interval of 30s an expression like rate(foo[1m]) should be replaced with:
rate(foo[90s]) * 60 / 90
or more generally (note that the expression within the brackets needs to be a time literal, can't be a calculation)
rate(foo[intended_range + scrape_interval]) * intended_range / (intended_range + scrape_interval)
The reason why this works is that the intended_range + scrape_interval range will give you enough samples to cover the increases over intended_range, which is what you want. But then you have to undo the change introduced by Prometheus' extrapolation, hence the multiplication and division that follow. It is an ugly hack and depends on you knowing your scrape interval and hardcoding it into your recording rules and/or Grafana queries.
Do note that whatever method you use, you will likely not get a value of exactly 1. Because of service, network and internal Prometheus latency, samples will usually not be aligned on the millisecond, so the rate of increase per second will be slightly below or slightly above the expected value.
Solution 2:[2]
Here an alternative that computes the correct changes for counter metrics:
max_over_time(http_server_requests_seconds_count{outcome!="REDIRECTION",outcome!="SUCCESS"}[1m])-min_over_time(http_server_requests_seconds_count{outcome!="REDIRECTION",outcome!="SUCCESS"}[1m])
Another thing that took me a while to figure out is that while plotting the above one needs to make sure that the resolution of the plot is not larger the the scrape interval. Else some or all of the spikes your are are expecting to see will not be shown.
Solution 3:[3]
You need to use changes() function. The following query returns non-zero values if matching counters change during the last minute (see [1m] lookbehind window in the query):
changes(http_server_requests_seconds_count{outcome!="REDIRECTION",outcome!="SUCCESS"}[1m])
As for the unexpected rate() and increase() results - please read this comment and this article.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Alin Sînp?lean |
| Solution 2 | David |
| Solution 3 | valyala |

