'Prometheus alert with a comparison binary operator not firing?

I'm trying to write a simplified example of an alert which fires after the Kafka consumer group lag metric exposed by the Kafka Exporter exceeds a certain value. With the following directory structure,

.
├── README.md
├── docker-compose.yml
├── kafka-exporter
│   ├── Dockerfile
│   └── run.sh
└── prometheus
    ├── alerts.rules.yml
    └── prometheus.yml

where the docker-compose.yml reads

version: '2'

networks:
  app-tier:
    driver: bridge

services:
  zookeeper:
    image: 'bitnami/zookeeper:latest'
    environment:
      - 'ALLOW_ANONYMOUS_LOGIN=yes'
    networks:
      - app-tier
  kafka:
    image: 'bitnami/kafka:latest'
    environment:
      - KAFKA_CFG_ZOOKEEPER_CONNECT=zookeeper:2181
      - ALLOW_PLAINTEXT_LISTENER=yes
    networks:
      - app-tier
  kafka-exporter:
    build: kafka-exporter
    ports:
      - "9308:9308"
    networks:
      - app-tier
    entrypoint: ["run.sh"]
  prometheus:
    image: bitnami/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - "./prometheus/prometheus.yml:/opt/bitnami/prometheus/conf/prometheus.yml"
      - "./prometheus/alerts.rules.yml:/alerts.rules.yml"
    networks:
      - app-tier
  grafana:
    image: grafana/grafana
    ports:
      - "3000:3000"
    networks:
      - app-tier

the run.sh is a wrapper script to wait for Kafka to be ready,

#!/bin/sh
while ! bin/kafka_exporter --verbosity 2; do
    echo "Waiting for the Kafka cluster to come up..."
    sleep 1
done

and the Prometheus configuration files are prometheus.yml,

global:
  scrape_interval: 10s
  scrape_timeout: 10s
  evaluation_interval: 1m

scrape_configs:
  - job_name: kafka-exporter
    metrics_path: /metrics
    honor_labels: false
    honor_timestamps: true
    sample_limit: 0
    static_configs:
      - targets: ['kafka-exporter:9308']

rule_files:
  - "/alerts.rules.yml"

and alerts.rules.yml,

groups:
  - name: alerts
    rules:
      - alert: excessive_consumer_group_lag
        expr: kafka_consumergroup_lag{topic="example"} > 10

One thing I've omitted here is an example app which consumes from the example topic using a consumer group named my-consumer-group, which I then manually stop and then produce messages to the topic using the Kafka console producer:

> docker run -it --network kafka-exporter-example_app-tier bitnami/kafka:latest kafka-console-producer.sh --topic example --bootstrap-server kafka:9092
kafka 18:42:41.24 
kafka 18:42:41.24 Welcome to the Bitnami kafka container
kafka 18:42:41.24 Subscribe to project updates by watching https://github.com/bitnami/bitnami-docker-kafka
kafka 18:42:41.25 Submit issues and feature requests at https://github.com/bitnami/bitnami-docker-kafka/issues
kafka 18:42:41.25 

>Message 1
>Message 2
...

After doing this for more than 10 times, I can see the corresponding metric increase in Grafana: enter image description here

However, in the Prometheus UI, the corresponding alert is neither pending nor firing:

enter image description here

I'm struggling to see why the alert is not firing? The expression seems similar to the one given in the example in https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source