'Prometheus alertmanager includes resolved alerts in a new alert

Question: Why do a group of resolved alerts X appear in a group of later alerts Y?

Background
We have an alert called "InstanceDown" with the expression up{job!=""} == 0. We have two testing environments DEV and TEST.

When deploying to DEV the alerts fire correctly and soon we get the [RESOLVED] messages. So far so good. But when deploying to TEST we get alerts for that environment AND the previous DEV alerts again - even though they were resolved some minutes earlier.

I suspect there's something with groups at play here? The group_by, group_wait, group_interval and repeat_interval settings? As seen below I group by the alertname. Could it be that the previous group of DEV alarms still remains, and that the TEST alarms are added to it when they come into play?

I've read plenty trying to understand these settings, and while some articles gave insights none could plainly lay them out for me. Below is my setup (including comments where I try to explain for myself how things are working - if you feel like reading that as well, please correct me on any misses):

Prometheus

scrape_interval:     1m
evaluation_interval: 1m

Alertmanager

group_by: ['alertname']  # Alerts with these labels end up in a group, i.e., they go together in just one Slack notification (makes it less spammy).
  group_wait: 30s          # How long to let a group build up, before sending it to Slack FOR THE FIRST TIME.
  group_interval: 5m       # The group now lives on in memory, sleeping, but checking in on things after each group_interval. If a new alert was added during an
                           # interval, an updated Slack notification is sent at these occasions.
                           #
                           #    CASES: 
                           #        a) NEW ALERT CAME IN DURING THE GROUP_INTERVAL TIME: 
                           #               - Group wakes up 
                           #               - New alert is added to group 
                           #               - Updated notification sent to Slack, including all it's alerts (the old ones and the new one).
                           #               - Group goes back to sleep for another group_interval.
                           #
                           #        b) NO NEW ALERTS:
                           #               - Group wakes up
                           #               - Group goes back to sleep for another group_interval.
                           #               
                           #        c) NO NEW ALERTS UP UNTIL THE REPEAT_INTERVAL:
                           #               - Group wakes up
                           #               - Group received no new alerts for a few successive group_intervals, and we now reached the timer of the
                           #                 repeat_interval. 
                           #               - Group now REPEATS it's latest Slack notification (NOTE: not updating it as earlier).
                           #               - NOTE: the repeat_interval is counted from the last Slack notification sent, not from the
                           #                 end of the group_interval that just elapsed.
                           #
                           #    ILLUSTRATION:
                           #         ________________________________ __________________________________________________________________________________________________
                           #        <<--      repeat interval       ||                                       repeat_interval                                           |
                           #         ________________________________ ________________________________ ________________________________ ________________________________
                           #        |        group_interval         ||        group_interval         ||        group_interval         ||        group_interval         |
                           #      [N1]--------------[A]-----------[N2]------------------------------------------------------------------------------------------------[N2]       
                           #       ^                 ^             ^                                 ^                                ^                                ^
                           #     first            new alert   updated notification               no new alerts,                   no new alerts,                   no new alerts,
                           # notification                        due to [A]                   no new notification              no new notification            repeat_interval elapsed
                           #                                                                                                                                        repeats N2
                           #                  
                           
  repeat_interval: 1h      # How long to wait before REPEATING a Slack notification that has already been sent.


Solution 1:[1]

As seen below I group by the alertname. Could it be that the previous group of DEV alarms still remains, and that the TEST alarms are added to it when they come into play?

Your suspect is correct. According to the official docs:

# The labels by which incoming alerts are grouped together. For example,
# multiple alerts coming in for cluster=A and alertname=LatencyHigh would
# be batched into a single group.
#
# To aggregate by all possible labels use the special value '...' as the sole label name, for example:
# group_by: ['...']
# This effectively disables aggregation entirely, passing through all
# alerts as-is. This is unlikely to be what you want, unless you have
# a very low alert volume or your upstream notification system performs
# its own grouping.
[ group_by: '[' <labelname>, ... ']' ]

Adding the alert's environment label to group_by config could solve this problem.

group_by: ['alertname', 'environment']

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 YwH