'Calculating live elapsed time using prometheus

We have around 16k batch jobs that run on a regular basis. Jobs have a name and each daily run of these 16k jobs have a run-id

Since these jobs take a good amount of time to finish, I want a live timer in grafana that tells me for how long a job has been running. e.g. now() - 'start-time of job' or if a job is completed then end-time - start-time of job

Our infrastructure is mainly prometheus & grafana. At first, I had the following idea of heartbeats (all abstract, finding it hard to map it in terms of prometheus & grafana)

On job start, emit status=1 (guage) (counter will increment) On job end, emit status=2 (guage)

Now the elapsed time in psuedocode would be

(get(status=2).map(timestamps).min or now()) - get(status=1).map(timestamps).min

Assuming get returns a vector of events where [status=<x>,timestamps]

Is prometheus even the right tool for this?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source