'http_request_duration_seconds_sum / http_request_duration_seconds_count shows 2 graphs
I have a Grafana dashboard, where I try to plot some of the prometheus metrics.
I am trying to calculate the average response time for 2 endpoints using the formula:
http_request_duration_seconds_sum / http_request_duration_seconds_count
but when plotting the query into the Grafana graph panel, I get 4 graphs (2 for each) instead of only 2, which I don't understand.
Can anyone tell me, why there are 4 curves instead of 2? The two on the top are from the same query and likewise for the two in the buttom.
UPDATE
Is adding
sum(rate(http_request_duration_sum))[24h] / sum(rate(http_request_duration_count))[24h]
the answer? That gives me 2 curves instead of 4, but not sure if the result is what I am looking for (being the average response time for the endpoint).
Solution 1:[1]
I found out that the following query:
sum(rate(http_request_duration_sum))[24h] / sum(rate(http_request_duration_count))[24h]
is the answer, I am looking for, giving me the average response time in seconds and only 1 curve pr query.
Of course the scrape_interval should not be 24h, so I've set it to [1m] instead. <- this according to this SO-answer
Solution 2:[2]
Yes, those metrics coming from prometheus are counters. So, you should add rate/irate. Use irate for volatile and fast moving metrics
Solution 3:[3]
The http_request_duration_sum and http_request_duration_count are metrics of counter type, so they usually increase over time and may sometimes reset to zero (for instance when the service, which exposes these metrics, is restarted):
- The
http_request_duration_summetric shows the sum of all the request durations since the last service restart. - The
http_request_duration_countmetric shows the total number of requests since the last service restart.
So http_request_duration_sum / http_request_duration_count gives the average request duration since the service start. This metric isn't useful, since it smooths possible request duration spikes and the smooth factor increases over time. Usually people want to see the average request duration over the last N minutes. This can be calculated by wrapping the counters into increase() function with the needed lookbehind duration in square brackets. For example, the following query returns the average request duration over the last 5 minutes (see 5m in square brackets):
increase(http_request_duration_sum[5m]) / increase(http_request_duration_count[5m])
This query may return multiple time series if the http_request_duration metric is exposed at multiple apps (aka jobs) or nodes (aka instances or scrape targets). If you need to get the average request duration over the last 5 minutes per each job, then the sum function must be used:
sum(increase(http_request_duration_sum[5m])) by (job)
/
sum(increase(http_request_duration_count[5m])) by (job)
Note that the sum(...) by (job) is applied individually to the left and the right part of /. This isn't equivalent to the following incorrect queries:
sum(
increase(http_request_duration_sum[5m]) / increase(http_request_duration_count[5m])
) by (job)
avg(
increase(http_request_duration_sum[5m]) / increase(http_request_duration_count[5m])
) by (job)
Since the first incorrect query calculates the sum of average response times per each job, while the second incorrect query calculates the average of averages of response times per each job. This is not what most users expect - see this answer for details.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | badaboomskey |
| Solution 2 | Rishindra Kumar |
| Solution 3 | valyala |

