'Error when using group_left in Prometheus
Getting error when trying to use group_left between two queries
The query is:
floor(avg_over_time(dcgm_gpu_utilization{cluster_name="researchers"}[5m]) * on (instance) group_left(node) max by (node) (kube_node_labels{label_grid="true"}))
And it shows this error:
Error executing query: found duplicate series for the match group {} on the right hand-side of the operation: [{node="gpu-m-08"}, {node="gpu-l-03"}];many-to-many matching not allowed: matching labels must be unique on one side
Query one output floor(avg_over_time(dcgm_gpu_utilization{cluster_name="researchers"}[5m])):
{app="prometheus-node-exporter",chart="prometheus-node-exporter-1.3.0",cluster_name="researchers",gpu="0",heritage="Tiller",instance="172.21.4.101:9100",job="kubernetes-service-endpoints",kubernetes_name="prometheus-node-exporter",kubernetes_namespace="monitoring",release="prometheus-node-exporter",uuid="GPU-92e6ebf6-2b2d-c041-7f70-e16812c0ffa0"}
Query two output max by (node) (kube_node_labels{label_grid="true"}):
{node="gpu-m-08"}
{node="gpu-m-09"}
{node="gpu-m-12"}
I just want to see the node label in the problematic Query output.
BTW this works (without the label_grid=true label):
floor(avg_over_time(dcgm_gpu_utilization{cluster_name="researchers"}[5m]) * on (instance) group_left(nodename) node_uname_info)
It adds the nodename to the Query output labels list.
The main goal is to just see metrics with the label label_grid="true" and their node name.
Solution 1:[1]
The RHS has no instance label, so it's trying to match all those series to one on the LHS. Try max by (node, instance) (kube_node_labels{label_grid="true"})
Solution 2:[2]
The group_left() modifier expects that the right-hand side of * operator (and any other operator) contains only a single time series per each label=value set specified inside on() modifier. Otherwise it returns duplicate series for the match group error. See these docs for more details.
The solution is to specify the proper labels inside on() modifier, so every label=value set for these labels would have only a single time series on the right-hand side of * operator. The instance label is a good candidate to put inside on() modifier. The only issue is that the dcgm_gpu_utilization and kube_node_labels are collected from different targets with different TCP port numbers. So they have different instance label values (see these docs explaining how instance label is generated). This breaks matching rules for * operator, so the following query returns nothing:
floor(avg_over_time(dcgm_gpu_utilization{cluster_name="researchers"}[5m]))
* on (instance) group_left(node)
kube_node_labels{label_grid="true"}
This can be fixed by stripping the port number from instance label at both sides of * operator with the help of label_replace function:
label_replace(
floor(avg_over_time(dcgm_gpu_utilization{cluster_name="researchers"}[5m])),
"hostname",
"$1",
"instance",
"([^:]+):.+"
)
* on (hostname) group_left(node)
label_replace(
kube_node_labels{label_grid="true"},
"hostname",
"$1",
"instance",
"([^:]+):.+"
)
This query extracts hostname part from instance labels, puts it into a hostname label and then joins the left-hand side and the right-hand side time series on this label.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | brian-brazil |
| Solution 2 | valyala |
