Apache Kafka exposes hundreds of internal metrics through Java Management Extensions (JMX), but those numbers are useless without a proper collection and visualization pipeline. Prometheus paired with Grafana gives you real-time dashboards, historical trend analysis, and alerting for every critical Kafka metric – from under-replicated partitions to consumer lag.

This guide walks through a full monitoring stack for Apache Kafka on Ubuntu 24.04 and Rocky Linux 10. We cover JMX Exporter for broker and ZooKeeper metrics, kafka_exporter for topic and consumer group data, Prometheus scrape configuration, Grafana dashboards, alerting rules, and Burrow for advanced consumer lag tracking.

Prerequisites

  • A running Apache Kafka cluster (1 or more brokers) with ZooKeeper or KRaft – see Install Apache Kafka on Ubuntu if you need a cluster set up
  • A Prometheus server (version 2.x or later) – refer to Install Prometheus on Ubuntu / Debian for setup instructions
  • A Grafana instance (version 10+) – see Install Grafana on Debian / Ubuntu
  • Servers running Ubuntu 24.04 LTS or Rocky Linux 10
  • Root or sudo access on all nodes
  • Java 17 or later installed on Kafka brokers
  • Network connectivity between Prometheus and Kafka nodes on ports 7071, 7072, and 9308

Key Kafka Metrics to Monitor

Before wiring up exporters, it helps to know which metrics actually matter in production. The table below lists the critical ones grouped by category.

MetricWhat It Tells YouAlert Threshold
UnderReplicatedPartitionsPartitions where replicas have fallen behind the leader. Non-zero means data is at risk.> 0 for more than 5 minutes
IsrShrinksPerSec / IsrExpandsPerSecIn-sync replica set changes. Frequent shrinks indicate broker or disk issues.Shrinks without matching expands
ActiveControllerCountNumber of active controllers in the cluster. Must be exactly 1.!= 1
OfflinePartitionsCountPartitions with no active leader. Zero tolerance – any offline partition means data is unavailable.> 0
RequestHandlerAvgIdlePercentHow busy the broker request handler threads are. Below 20% means the broker is saturated.< 0.2
TotalProduceRequestsPerSecProduce request throughput per broker.Sudden drops or spikes
TotalFetchRequestsPerSecFetch request throughput (consumer and follower).Sudden drops or spikes
LogSizeBytes (per topic/partition)Disk usage per partition. Tracks growth rate for capacity planning.Custom per environment
ConsumerLagNumber of messages a consumer group is behind the latest offset. High lag means consumers cannot keep up.> 10000 (varies by workload)
RequestLatency (Produce/Fetch 99th)End-to-end latency for produce and fetch requests at the 99th percentile.> 500ms

Step 1: Install JMX Exporter on Kafka Brokers

The Prometheus JMX Exporter runs as a Java agent attached to the Kafka broker JVM process. It scrapes all MBeans and exposes them on an HTTP endpoint in Prometheus format. Download the latest JMX Exporter jar file from the Maven Central repository.

Run these commands on each Kafka broker node.

$ sudo mkdir -p /opt/jmx-exporter
$ cd /opt/jmx-exporter
$ sudo wget https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/1.0.1/jmx_prometheus_javaagent-1.0.1.jar

Verify the download completed successfully.

$ ls -lh /opt/jmx-exporter/
total 520K
-rw-r--r-- 1 root root 518K Mar 18 10:22 jmx_prometheus_javaagent-1.0.1.jar

Create JMX Exporter Configuration for Kafka

The JMX Exporter needs a YAML configuration file that defines which MBeans to collect and how to map them to Prometheus metric names. Create the Kafka broker configuration file.

$ sudo vim /opt/jmx-exporter/kafka-broker.yml

Add the following content. This configuration captures all critical broker metrics including request rates, partition state, replication health, and request latency percentiles.

lowercaseOutputName: true
lowercaseOutputLabelNames: true

rules:
# Broker topic metrics - message rates, byte rates
- pattern: kafka.server<>Count
  name: kafka_server_brokertopicmetrics_$1_total
  type: COUNTER
  labels:
    topic: "$2"

- pattern: kafka.server<>Count
  name: kafka_server_brokertopicmetrics_$1_total
  type: COUNTER

# Request metrics - latency percentiles
- pattern: kafka.network<>(\d+)thPercentile
  name: kafka_network_requestmetrics_$1
  type: GAUGE
  labels:
    request: "$2"
    quantile: "0.$3"

- pattern: kafka.network<>Count
  name: kafka_network_requestmetrics_$1_total
  type: COUNTER
  labels:
    request: "$2"

# Replica manager - ISR shrink/expand, under-replicated partitions
- pattern: kafka.server<>Value
  name: kafka_server_replicamanager_$1
  type: GAUGE

- pattern: kafka.server<>Count
  name: kafka_server_replicamanager_$1_total
  type: COUNTER

# Controller metrics
- pattern: kafka.controller<>Value
  name: kafka_controller_kafkacontroller_$1
  type: GAUGE

# Log metrics - size per topic/partition
- pattern: kafka.log<>Value
  name: kafka_log_size_bytes
  type: GAUGE
  labels:
    topic: "$1"
    partition: "$2"

# Request handler pool utilization
- pattern: kafka.server<>OneMinuteRate
  name: kafka_server_requesthandleravgidlepercent
  type: GAUGE

# Purgatory metrics
- pattern: kafka.server<>Value
  name: kafka_server_delayedoperationpurgatory_purgatorysize
  type: GAUGE
  labels:
    delayedOperation: "$1"

# Generic per-second counters with labels
- pattern: kafka.(\w+)<>Count
  name: kafka_$1_$2_$3_total
  type: COUNTER
  labels:
    "$4": "$5"
    "$6": "$7"

- pattern: kafka.(\w+)<>Count
  name: kafka_$1_$2_$3_total
  type: COUNTER
  labels:
    "$4": "$5"

- pattern: kafka.(\w+)<>Count
  name: kafka_$1_$2_$3_total
  type: COUNTER

# Generic gauges
- pattern: kafka.(\w+)<>Value
  name: kafka_$1_$2_$3
  type: GAUGE
  labels:
    "$4": "$5"
    "$6": "$7"

- pattern: kafka.(\w+)<>Value
  name: kafka_$1_$2_$3
  type: GAUGE
  labels:
    "$4": "$5"

- pattern: kafka.(\w+)<>Value
  name: kafka_$1_$2_$3
  type: GAUGE

Attach JMX Exporter to Kafka Broker

The JMX Exporter must be loaded as a Java agent when the Kafka broker starts. The recommended way is to set the KAFKA_OPTS environment variable in the broker’s systemd unit file. Open the Kafka service file.

$ sudo vim /etc/systemd/system/kafka.service

Add (or update) the KAFKA_OPTS environment variable under the [Service] section. Port 7071 is the HTTP endpoint where the exporter will serve metrics.

[Service]
Type=simple
User=kafka
Environment="JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64"
Environment="KAFKA_OPTS=-javaagent:/opt/jmx-exporter/jmx_prometheus_javaagent-1.0.1.jar=7071:/opt/jmx-exporter/kafka-broker.yml"
ExecStart=/opt/kafka/bin/kafka-server-start.sh /opt/kafka/config/server.properties
ExecStop=/opt/kafka/bin/kafka-server-stop.sh
Restart=on-abnormal

On Rocky Linux 10, the Java home path differs slightly.

Environment="JAVA_HOME=/usr/lib/jvm/java-17-openjdk"

Reload systemd and restart Kafka.

$ sudo systemctl daemon-reload
$ sudo systemctl restart kafka

Verify the JMX exporter is listening on port 7071.

$ sudo ss -tlnp | grep 7071
LISTEN  0  3  *:7071  *:*  users:(("java",pid=4521,fd=98))

Test that metrics are being served.

$ curl -s http://localhost:7071/metrics | head -20
# HELP jvm_info VM version info
# TYPE jvm_info gauge
jvm_info{runtime="OpenJDK Runtime Environment",vendor="Ubuntu",version="17.0.13+11-Ubuntu-2ubuntu1"} 1.0
# HELP kafka_server_replicamanager_underreplicatedpartitions kafka.server ReplicaManager UnderReplicatedPartitions
# TYPE kafka_server_replicamanager_underreplicatedpartitions gauge
kafka_server_replicamanager_underreplicatedpartitions 0.0
...

Open the firewall port if Prometheus runs on a separate server.

# Ubuntu 24.04
$ sudo ufw allow 7071/tcp

# Rocky Linux 10
$ sudo firewall-cmd --permanent --add-port=7071/tcp
$ sudo firewall-cmd --reload
JMX exporter metrics displayed in browser for Kafka broker

Step 2: Install JMX Exporter on ZooKeeper

ZooKeeper health directly affects Kafka cluster stability. If you are running Kafka in ZooKeeper mode (not KRaft), attach the JMX Exporter to ZooKeeper as well. Copy the same jar to your ZooKeeper nodes.

$ sudo mkdir -p /opt/jmx-exporter
$ sudo cp /opt/jmx-exporter/jmx_prometheus_javaagent-1.0.1.jar /opt/jmx-exporter/

Create a ZooKeeper-specific JMX configuration file.

$ sudo vim /opt/jmx-exporter/zookeeper.yml

Add the following rules to capture ZooKeeper session counts, latency, and connection metrics.

lowercaseOutputName: true
lowercaseOutputLabelNames: true

rules:
- pattern: org.apache.ZooKeeperService<>(\w+)
  name: zookeeper_$4
  type: GAUGE
  labels:
    replicaId: "$2"

- pattern: org.apache.ZooKeeperService<>(\w+)
  name: zookeeper_$3
  type: GAUGE
  labels:
    replicaId: "$2"

- pattern: org.apache.ZooKeeperService<>(\w+)
  name: zookeeper_$2
  type: GAUGE

Add the Java agent to the ZooKeeper systemd service. Open the ZooKeeper unit file.

$ sudo vim /etc/systemd/system/zookeeper.service

Add the KAFKA_OPTS or EXTRA_ARGS environment variable. ZooKeeper uses port 7072 to avoid conflicts with the broker exporter.

[Service]
Type=simple
User=kafka
Environment="JAVA_HOME=/usr/lib/jvm/java-17-openjdk-amd64"
Environment="KAFKA_OPTS=-javaagent:/opt/jmx-exporter/jmx_prometheus_javaagent-1.0.1.jar=7072:/opt/jmx-exporter/zookeeper.yml"
ExecStart=/opt/kafka/bin/zookeeper-server-start.sh /opt/kafka/config/zookeeper.properties
ExecStop=/opt/kafka/bin/zookeeper-server-stop.sh
Restart=on-abnormal

Restart ZooKeeper and verify the exporter is running.

$ sudo systemctl daemon-reload
$ sudo systemctl restart zookeeper
$ curl -s http://localhost:7072/metrics | grep zookeeper_ | head -5
zookeeper_avgrequestlatency 0.0
zookeeper_maxrequestlatency 12.0
zookeeper_numalivedconnections 2.0
zookeeper_outstandingrequests 0.0
zookeeper_packetsreceived 145.0

Step 3: Install kafka_exporter for Topic and Consumer Metrics

The JMX Exporter gives you broker-level metrics, but it does not expose per-topic consumer lag or consumer group offsets directly. For that, use kafka_exporter by danielqsj, which connects to Kafka as a client and exports topic, partition, and consumer group metrics.

Download and install kafka_exporter on any server that has network access to the Kafka cluster.

$ cd /tmp
$ wget https://github.com/danielqsj/kafka_exporter/releases/download/v1.8.0/kafka_exporter-1.8.0.linux-amd64.tar.gz
$ tar xzf kafka_exporter-1.8.0.linux-amd64.tar.gz
$ sudo mv kafka_exporter-1.8.0.linux-amd64/kafka_exporter /usr/local/bin/
$ kafka_exporter --version
kafka_exporter 1.8.0

Create a systemd service for kafka_exporter.

$ sudo vim /etc/systemd/system/kafka-exporter.service

Add the following content. Replace the broker addresses with your actual Kafka broker IPs or hostnames.

[Unit]
Description=Kafka Exporter
After=network.target

[Service]
Type=simple
User=nobody
ExecStart=/usr/local/bin/kafka_exporter \
  --kafka.server=10.0.1.10:9092 \
  --kafka.server=10.0.1.11:9092 \
  --kafka.server=10.0.1.12:9092 \
  --topic.filter=".*" \
  --group.filter=".*" \
  --web.listen-address=:9308
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

Enable and start the service.

$ sudo systemctl daemon-reload
$ sudo systemctl enable --now kafka-exporter

Verify kafka_exporter is running and returning metrics.

$ sudo systemctl status kafka-exporter
  Active: active (running) since Thu 2026-03-19 10:30:15 UTC

$ curl -s http://localhost:9308/metrics | grep kafka_consumergroup_lag | head -5
kafka_consumergroup_lag{consumergroup="my-group",partition="0",topic="orders"} 42
kafka_consumergroup_lag{consumergroup="my-group",partition="1",topic="orders"} 18
kafka_consumergroup_lag{consumergroup="my-group",partition="2",topic="orders"} 7

The key metrics exposed by kafka_exporter include:

  • kafka_consumergroup_lag – consumer lag per partition per consumer group
  • kafka_consumergroup_current_offset – current offset of each consumer group
  • kafka_topic_partition_current_offset – latest offset (log end) per partition
  • kafka_topic_partitions – number of partitions per topic
  • kafka_topic_partition_replicas – replica count per partition
  • kafka_topic_partition_in_sync_replica – in-sync replica count per partition

Step 4: Configure Prometheus to Scrape Kafka Metrics

With all three exporters running (JMX on brokers, JMX on ZooKeeper, kafka_exporter), configure Prometheus to scrape them. Open the Prometheus configuration file.

$ sudo vim /etc/prometheus/prometheus.yml

Add the following scrape job configurations under the scrape_configs section. Replace the IP addresses with your actual server addresses.

scrape_configs:
  # Kafka broker JMX metrics
  - job_name: 'kafka-brokers'
    scrape_interval: 15s
    static_configs:
      - targets:
        - '10.0.1.10:7071'
        - '10.0.1.11:7071'
        - '10.0.1.12:7071'
        labels:
          cluster: 'production'

  # ZooKeeper JMX metrics
  - job_name: 'zookeeper'
    scrape_interval: 15s
    static_configs:
      - targets:
        - '10.0.1.10:7072'
        - '10.0.1.11:7072'
        - '10.0.1.12:7072'

  # kafka_exporter - topic and consumer group metrics
  - job_name: 'kafka-exporter'
    scrape_interval: 30s
    static_configs:
      - targets:
        - '10.0.1.10:9308'

Validate the configuration and restart Prometheus.

$ promtool check config /etc/prometheus/prometheus.yml
Checking /etc/prometheus/prometheus.yml
 SUCCESS: /etc/prometheus/prometheus.yml is valid prometheus config file

$ sudo systemctl restart prometheus

Open the Prometheus web UI at http://prometheus-server:9090 and go to Status > Targets. All three job groups should show as UP.

Prometheus web UI showing Kafka broker targets
Prometheus targets page with Kafka brokers showing UP status

Test a quick query in the Prometheus expression browser to confirm metrics are flowing.

kafka_server_replicamanager_underreplicatedpartitions

This should return a value (ideally 0) for each broker.

Step 5: Set Up Grafana Dashboards for Kafka Monitoring

Log into your Grafana instance and add Prometheus as a data source if you have not already done so. Navigate to Connections > Data Sources > Add data source, select Prometheus, and enter your Prometheus server URL (for example, http://10.0.1.20:9090).

Grafana configuration page for adding Prometheus data source
Selecting Prometheus as data source type in Grafana
Configuring Prometheus URL in Grafana data source settings

Click Save and Test. A green success message confirms Grafana can reach Prometheus.

Grafana data source test successful for Prometheus

Import Pre-built Kafka Dashboards

Grafana has several community dashboards built specifically for Kafka monitoring. Go to Dashboards > New > Import and use the following dashboard IDs.

Dashboard IDNameWhat It Shows
7589Kafka Exporter OverviewConsumer lag, topic offsets, partition counts from kafka_exporter
721JMX OverviewJVM heap, GC, threads – useful for broker JVM health
11962Kafka Cluster OverviewBroker metrics, request rates, ISR changes, under-replicated partitions

For each dashboard, enter the ID in the Import field and click Load.

Grafana import dashboard page with dashboard ID field

Select the Prometheus data source from the dropdown and click Import.

Grafana import settings showing Prometheus data source selection

The dashboard loads immediately with live data from your Kafka cluster.

Grafana dashboard showing Kafka broker metrics and graphs

Essential Custom Panels

Beyond the pre-built dashboards, add these custom panels to a dedicated Kafka Operations dashboard. Each panel uses a PromQL query that targets the metrics we configured earlier.

Under-replicated partitions across all brokers (should always be 0):

sum(kafka_server_replicamanager_underreplicatedpartitions) by (instance)

Active controller count (must be exactly 1):

sum(kafka_controller_kafkacontroller_activecontrollercount)

Produce request rate per broker:

rate(kafka_server_brokertopicmetrics_totalproducerequests_total[5m])

Fetch request latency at the 99th percentile:

kafka_network_requestmetrics_totaltimerequests{request="Fetch",quantile="0.99"}

Total consumer lag per consumer group:

sum(kafka_consumergroup_lag) by (consumergroup)

Log size growth rate per topic (bytes per second):

sum(rate(kafka_log_size_bytes[10m])) by (topic)

Step 6: Configure Alerting Rules for Kafka

Dashboards are great for visual inspection, but alerts catch problems when nobody is watching. Create a Prometheus alerting rules file for Kafka. If you already have Alertmanager configured for email notifications, these rules will fire alerts through your existing notification channels.

$ sudo vim /etc/prometheus/rules/kafka-alerts.yml

Add the following alerting rules.

groups:
  - name: kafka-alerts
    rules:
      # Alert when any partition has no leader
      - alert: KafkaOfflinePartitions
        expr: sum(kafka_controller_kafkacontroller_offlinepartitionscount) > 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Kafka has offline partitions"
          description: "{{ $value }} partitions have no active leader. Data is unavailable for affected topics."

      # Alert when under-replicated partitions exist for more than 5 minutes
      - alert: KafkaUnderReplicatedPartitions
        expr: sum(kafka_server_replicamanager_underreplicatedpartitions) by (instance) > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Under-replicated partitions on {{ $labels.instance }}"
          description: "Broker {{ $labels.instance }} has {{ $value }} under-replicated partitions for more than 5 minutes."

      # Alert when there is no active controller or more than one
      - alert: KafkaNoActiveController
        expr: sum(kafka_controller_kafkacontroller_activecontrollercount) != 1
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Kafka active controller count is {{ $value }}"
          description: "Expected exactly 1 active controller, got {{ $value }}. Cluster may be in a split-brain or leaderless state."

      # Alert when a broker is down (target is unreachable)
      - alert: KafkaBrokerDown
        expr: up{job="kafka-brokers"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Kafka broker {{ $labels.instance }} is down"
          description: "Prometheus cannot reach broker {{ $labels.instance }} for over 2 minutes."

      # Alert when consumer lag exceeds threshold
      - alert: KafkaConsumerLagHigh
        expr: sum(kafka_consumergroup_lag) by (consumergroup, topic) > 10000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High consumer lag for {{ $labels.consumergroup }} on {{ $labels.topic }}"
          description: "Consumer group {{ $labels.consumergroup }} has {{ $value }} messages of lag on topic {{ $labels.topic }} for over 10 minutes."

      # Alert when consumer lag is growing continuously
      - alert: KafkaConsumerLagGrowing
        expr: avg_over_time(kafka_consumergroup_lag[30m]) - avg_over_time(kafka_consumergroup_lag[30m] offset 30m) > 5000
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Consumer lag growing for {{ $labels.consumergroup }}"
          description: "Lag for {{ $labels.consumergroup }} on {{ $labels.topic }} has been increasing for 15+ minutes."

      # Alert when ISR is shrinking frequently
      - alert: KafkaIsrShrinkRate
        expr: rate(kafka_server_replicamanager_isrshrinks_total[5m]) > 0
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "ISR shrinking on {{ $labels.instance }}"
          description: "In-sync replica set has been shrinking on broker {{ $labels.instance }} for over 10 minutes. Check disk I/O and network."

      # Alert when request handler threads are saturated
      - alert: KafkaRequestHandlerSaturated
        expr: kafka_server_requesthandleravgidlepercent < 0.2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Request handler threads saturated on {{ $labels.instance }}"
          description: "Broker {{ $labels.instance }} request handler idle percent is {{ $value }}. Broker is overloaded."

Reference this rules file in the Prometheus configuration.

$ sudo vim /etc/prometheus/prometheus.yml

Add the rules file path under the rule_files section.

rule_files:
  - "/etc/prometheus/rules/kafka-alerts.yml"

Validate and reload Prometheus.

$ promtool check rules /etc/prometheus/rules/kafka-alerts.yml
Checking /etc/prometheus/rules/kafka-alerts.yml
  SUCCESS: 8 rules found

$ sudo systemctl reload prometheus

Verify the alerts are loaded in the Prometheus UI under Alerts. All rules should show as inactive (green) when the cluster is healthy.

Step 7: Set Up Burrow for Advanced Consumer Lag Monitoring

While kafka_exporter provides basic consumer lag numbers, Burrow (developed by LinkedIn) offers a smarter approach. Burrow evaluates consumer lag as a sliding window and classifies consumer status as OK, WARNING, or STOP based on whether the consumer is making progress, falling behind, or has stalled entirely. This eliminates false alerts from bursty workloads where lag spikes briefly but recovers quickly.

Install Burrow from the pre-built binary or build from source with Go.

$ cd /tmp
$ wget https://github.com/linkedin/Burrow/releases/download/v1.6.0/Burrow_1.6.0_linux_amd64.tar.gz
$ tar xzf Burrow_1.6.0_linux_amd64.tar.gz
$ sudo mv Burrow /usr/local/bin/burrow
$ burrow --version

Create the Burrow configuration directory and main config file.

$ sudo mkdir -p /etc/burrow

Create the configuration file.

$ sudo vim /etc/burrow/burrow.toml

Add the following configuration. Adjust the Kafka broker and ZooKeeper addresses for your environment.

[general]
pidfile="/var/run/burrow/burrow.pid"
stdout-logfile="/var/log/burrow/burrow.log"

[logging]
level="info"

[zookeeper]
servers=["10.0.1.10:2181","10.0.1.11:2181","10.0.1.12:2181"]
timeout=6
root-path="/burrow"

[client-profile.kafka-profile]
kafka-version="2.0.0"
client-id="burrow-monitor"

[cluster.production]
class-name="kafka"
servers=["10.0.1.10:9092","10.0.1.11:9092","10.0.1.12:9092"]
client-profile="kafka-profile"
topic-refresh=60
offset-refresh=30

[consumer.production]
class-name="kafka"
cluster="production"
servers=["10.0.1.10:9092","10.0.1.11:9092","10.0.1.12:9092"]
client-profile="kafka-profile"
group-denylist="^(console-consumer-|_).*$"
start-latest=true

[httpserver.default]
address=":8000"

[storage.default]
class-name="inmemory"
workers=20
intervals=10
expire-group=604800
min-distance=1

Create a systemd service file for Burrow.

$ sudo vim /etc/systemd/system/burrow.service

Add the following content.

[Unit]
Description=Burrow Kafka Consumer Lag Monitor
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/burrow --config-dir /etc/burrow
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

Create the required directories and start the service.

$ sudo mkdir -p /var/run/burrow /var/log/burrow
$ sudo systemctl daemon-reload
$ sudo systemctl enable --now burrow

Verify Burrow is running by querying its HTTP API.

$ curl -s http://localhost:8000/v3/kafka | python3 -m json.tool
{
    "error": false,
    "message": "cluster list returned",
    "clusters": [
        "production"
    ]
}

Check consumer group status through Burrow's evaluation endpoint.

$ curl -s http://localhost:8000/v3/kafka/production/consumer/my-group/lag | python3 -m json.tool
{
    "error": false,
    "message": "consumer status returned",
    "status": {
        "cluster": "production",
        "group": "my-group",
        "status": "OK",
        "maxlag": {
            "topic": "orders",
            "partition": 2,
            "owner": "",
            "status": "OK",
            "start": { "offset": 15230, "timestamp": 1710842100000, "lag": 5 },
            "end": { "offset": 15280, "timestamp": 1710842400000, "lag": 3 }
        },
        "totallag": 8
    }
}

Expose Burrow Metrics to Prometheus

Burrow does not natively expose Prometheus metrics. Use the burrow_exporter bridge to convert Burrow's HTTP API responses into Prometheus format.

$ cd /tmp
$ wget https://github.com/jirwin/burrow_exporter/releases/download/v0.1.0/burrow-exporter-linux-amd64
$ chmod +x burrow-exporter-linux-amd64
$ sudo mv burrow-exporter-linux-amd64 /usr/local/bin/burrow-exporter

Run burrow_exporter as a service. It queries Burrow and re-exposes metrics for Prometheus on port 9188.

$ sudo vim /etc/systemd/system/burrow-exporter.service

Add the service configuration.

[Unit]
Description=Burrow Exporter for Prometheus
After=burrow.service

[Service]
Type=simple
ExecStart=/usr/local/bin/burrow-exporter \
  --burrow-addr=http://localhost:8000 \
  --metrics-addr=:9188 \
  --interval=30
Restart=on-failure

[Install]
WantedBy=multi-user.target

Enable and start the exporter.

$ sudo systemctl daemon-reload
$ sudo systemctl enable --now burrow-exporter

Add the burrow-exporter target to Prometheus scrape config.

  - job_name: 'burrow'
    scrape_interval: 30s
    static_configs:
      - targets: ['10.0.1.10:9188']

Reload Prometheus to pick up the new target.

$ sudo systemctl reload prometheus

Step 8: Verify the Complete Monitoring Stack

With all components deployed, run through this checklist to confirm everything is connected and working.

Check all Prometheus targets are UP.

$ curl -s http://localhost:9090/api/v1/targets | python3 -c "
import json,sys
data = json.load(sys.stdin)
for t in data['data']['activeTargets']:
    print(f\"{t['labels']['job']:20s} {t['labels']['instance']:25s} {t['health']}\")"
kafka-brokers         10.0.1.10:7071            up
kafka-brokers         10.0.1.11:7071            up
kafka-brokers         10.0.1.12:7071            up
zookeeper             10.0.1.10:7072            up
zookeeper             10.0.1.11:7072            up
zookeeper             10.0.1.12:7072            up
kafka-exporter        10.0.1.10:9308            up
burrow                10.0.1.10:9188            up

Verify critical metrics are being collected.

# Under-replicated partitions (should be 0)
$ curl -s 'http://localhost:9090/api/v1/query?query=kafka_server_replicamanager_underreplicatedpartitions' | python3 -m json.tool | grep value

# Active controllers (should be 1)
$ curl -s 'http://localhost:9090/api/v1/query?query=sum(kafka_controller_kafkacontroller_activecontrollercount)' | python3 -m json.tool | grep value

# Consumer lag
$ curl -s 'http://localhost:9090/api/v1/query?query=sum(kafka_consumergroup_lag)+by+(consumergroup)' | python3 -m json.tool | grep value

Confirm alert rules loaded correctly.

$ curl -s http://localhost:9090/api/v1/rules | python3 -c "
import json,sys
data = json.load(sys.stdin)
for g in data['data']['groups']:
    for r in g['rules']:
        print(f\"{r['name']:40s} {r['health']}\")"
KafkaOfflinePartitions                   ok
KafkaUnderReplicatedPartitions           ok
KafkaNoActiveController                  ok
KafkaBrokerDown                          ok
KafkaConsumerLagHigh                     ok
KafkaConsumerLagGrowing                  ok
KafkaIsrShrinkRate                       ok
KafkaRequestHandlerSaturated             ok

Monitoring Summary and Port Reference

The table below summarizes all components, ports, and their purpose for quick reference.

ComponentPortProtocolPurpose
JMX Exporter (Kafka)7071TCPBroker JMX metrics in Prometheus format
JMX Exporter (ZooKeeper)7072TCPZooKeeper JMX metrics
kafka_exporter9308TCPTopic, partition, consumer group metrics
Burrow8000TCPConsumer lag evaluation HTTP API
Burrow Exporter9188TCPBurrow metrics in Prometheus format
Prometheus9090TCPMetrics storage, querying, and alerting
Grafana3000TCPDashboards and visualization

Conclusion

You now have a full Kafka monitoring stack with JMX Exporter for broker and ZooKeeper internals, kafka_exporter for consumer group tracking, Prometheus for metrics collection and alerting, Grafana for dashboards, and Burrow for intelligent consumer lag evaluation. The alerting rules cover the most critical failure scenarios - offline partitions, broker outages, controller issues, and consumer lag growth.

For production hardening, enable TLS between exporters and Prometheus, set up Alertmanager with PagerDuty or Slack integrations for on-call routing, and consider running Prometheus with long-term storage using Thanos or Cortex if you need retention beyond 15 days. Review Apache Kafka best practices for additional production tuning.

Related Guides

4 COMMENTS

  1. No meu caso eu não consegui concluir a etapa 3 com sucesso. Fiz toda a configuração, mas ao chamar o localhost:7075 não encontra nada.

LEAVE A REPLY

Please enter your comment!
Please enter your name here