'How to monitor latency(a.k.a response time) in a high throughput application? What is the correct use case of, for example, openTSDB with such needs?

Our system is currently using Prometheus to monitor service latency. But the underlying mechanism, i.e. a bucket based counter implies inaccurate p99 latency, and we observed that in few cases:

when QPS goes up from 1k to 10k, the p99 latency goes down from 9ms to 6ms, which we'd expect it to go up;
when the same latency applied both on upstream and downstream, ideally, we'd expect their difference will not exceed 2*RTT, hopefully not exceeding 4ms. However a real example gives 26ms diff;

We have figured out the true reason why these cases could happen. It's all about the bucket we use. But if we want a more fancy and accurate way to calculate the p99 latency, then we should not rely on 'how to correctly choose bucket'. So we are looking at other monitoring infrastructure that can do this in a more smart way.

According to my research, t-digest can be a nice algorithm for such task. But I wonder if anyone has used it with Prometheus, or with system like openTSDB. I cannot find any use case on the Internet.

In case of the use with openTSDB, what if 2 or 10(or more) request report the metrics at the same timestamp(in second), wouldn't that cause a conflict? Shall we do a pre-aggregate before reporting the metrics to openTSDB?

Can anyone give a real example of these?

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'How to monitor latency(a.k.a response time) in a high throughput application? What is the correct use case of, for example, openTSDB with such needs?

Sources

Related Questions