Configure Grafana Dashboards and Alerting on Kubernetes [Tested]

Deploying kube-prometheus-stack gives you Grafana with 20+ pre-built Kubernetes dashboards out of the box. That’s a solid starting point, but production monitoring demands more. You need custom dashboards tailored to your actual workloads, PromQL queries that surface the metrics your team cares about, and alert rules that wake up the right person at 3 AM when a pod starts crash-looping.

Original content from computingforgeeks.com - post 164801

This guide walks through building custom Grafana dashboards with template variables and PromQL, configuring Grafana Unified Alerting with Slack and PagerDuty contact points, and setting up production-grade alert rules for Kubernetes. Everything here was tested on a live k3s cluster running kube-prometheus-stack 82.14.1 with real workloads across multiple namespaces. If you haven’t deployed the monitoring stack yet, start with our Prometheus and Grafana installation guide for Kubernetes.

Tested March 2026 | kube-prometheus-stack 82.14.1, Grafana 11.x, Prometheus 3.x, k3s v1.34.5

Explore the Built-in Dashboards

kube-prometheus-stack ships with over 20 dashboards provisioned automatically via ConfigMaps. These cover cluster resources, node metrics, Kubernetes internals, and the monitoring stack itself. Open Grafana (port 30080 in this setup), click Dashboards in the left sidebar, and you’ll see the full list organized by folder.

The most useful built-in dashboards for day-to-day operations:

Kubernetes / Compute Resources / Cluster: CPU and memory usage aggregated across all nodes and namespaces
Kubernetes / Compute Resources / Namespace (Pods): per-pod resource consumption filtered by namespace
Kubernetes / Compute Resources / Node (Pods): resource usage broken down by node with pod-level detail
Node Exporter Full: deep host-level metrics (disk I/O, network, filesystem, CPU per core)
CoreDNS: DNS query rates, latency, and cache hit ratios
etcd: leader elections, proposal rates, WAL sync duration
Kubelet: pod lifecycle operations, runtime latency, volume manager stats
API Server: request rates, latency percentiles, admission webhook duration

The cluster compute resources dashboard is the one most teams open first. It gives a bird’s-eye view of CPU and memory across every namespace.

The node-level dashboard breaks resource usage down per pod on each node, which is useful for spotting noisy neighbors or imbalanced scheduling.

These dashboards are read-only by default because they are provisioned from ConfigMaps. To customize one, use the dashboard’s Save As option to create an editable copy, then modify the copy.

Import Community Dashboards

The Grafana dashboard marketplace has thousands of community-built dashboards. Rather than building everything from scratch, import proven dashboards for common use cases and customize from there.

Dashboard ID	Name	Use Case
1860	Node Exporter Full	Detailed host-level metrics with CPU, memory, disk, network panels
315	Kubernetes Cluster	Cluster-wide overview with namespace and pod breakdowns
12708	Nginx Ingress Controller	Ingress traffic rates, latency percentiles, error rates
14981	K8s Pod Resources	Per-pod CPU and memory usage with resource limits comparison

To import a community dashboard, navigate to Dashboards > New > Import in the Grafana sidebar. Enter the dashboard ID (for example, 1860) and click Load. Select Prometheus as the data source and click Import. The dashboard appears immediately with live data from your cluster.

Imported dashboards are fully editable, unlike the provisioned ones. You can rearrange panels, change PromQL queries, and save changes directly.

Build a Custom Dashboard from Scratch

Built-in dashboards are great for general Kubernetes health, but they don’t know anything about your application topology. A custom dashboard lets you focus on the namespaces, pods, and metrics that matter to your team. Here’s how to build a production overview dashboard with template variables, stat panels, time series graphs, and a pod status table.

Create the Dashboard and Add a Namespace Variable

Click Dashboards > New Dashboard in the Grafana sidebar. Before adding any panels, set up a template variable so every panel can be filtered by namespace. Click the gear icon (Dashboard Settings) at the top, then go to Variables > Add variable.

Configure the variable with these settings:

Name: namespace
Type: Query
Data source: Prometheus
Query: label_values(kube_pod_info, namespace)
Include All option: Enabled
Custom all value: .*

This creates a dropdown at the top of the dashboard populated with every namespace in the cluster. Selecting a namespace filters all panels that reference $namespace in their queries. The “All” option uses a regex wildcard to show data from every namespace combined.

Add Stat Panels for Key Metrics

Stat panels are single-value displays that sit across the top row and give an instant health snapshot. Add four of them by clicking Add > Visualization and selecting Stat as the visualization type for each.

Running Pods count:

count(kube_pod_status_phase{namespace=~"$namespace",phase="Running"})

CPU Usage (percentage across all containers in the namespace):

sum(rate(container_cpu_usage_seconds_total{namespace=~"$namespace",container!=""}[5m])) * 100

Memory Usage (working set in MiB):

sum(container_memory_working_set_bytes{namespace=~"$namespace",container!=""}) / 1024 / 1024

Pod Restarts (total across namespace):

sum(kube_pod_container_status_restarts_total{namespace=~"$namespace"})

Set the unit to Percent (0-100) for CPU, mebibytes for memory, and short for the pod count and restart panels. Configure color thresholds on the restart panel (green at 0, red above 0) so restarts are immediately visible.

Add Time Series Panels

Below the stat row, add two time series panels for CPU and memory trends over time. These show which pods are consuming the most resources and whether usage is trending upward.

CPU by Pod (percentage per pod over time):

sum by(pod)(rate(container_cpu_usage_seconds_total{namespace=~"$namespace",container!=""}[5m])) * 100

Memory by Pod (MiB per pod over time):

sum by(pod)(container_memory_working_set_bytes{namespace=~"$namespace",container!=""}) / 1024 / 1024

Set the legend to {{pod}} on both panels so each line is labeled with the pod name. The time series visualization is the default in Grafana 11.x, so just paste the query and it renders correctly.

Add a Pod Status Table

A table panel showing current pod status across namespaces is useful for spotting pods stuck in Pending or Failed states. Use the Table visualization with this query:

kube_pod_status_phase{namespace=~"$namespace"} == 1

Set the format to Table in the query options. Configure column transformations to show only the namespace, pod, and phase labels, hiding the value and other internal labels.

The Finished Dashboard

With the production namespace selected, the dashboard shows 6 running pods (3 web-frontend, 2 api-backend, 1 cache-redis), 0.135% CPU usage, 32.7 MiB memory, and 0 restarts. These numbers are from a real k3s cluster with lightweight test workloads.

Switching the namespace dropdown to “All” aggregates metrics across production and staging, showing 8 total pods.

Save the dashboard with a descriptive name like “K8s Production Overview” and optionally tag it with kubernetes and custom for easier searching.

PromQL Quick Reference for Kubernetes

PromQL is where most teams get stuck. The official PromQL documentation covers the language in full, but here are the queries you’ll use most often when monitoring Kubernetes workloads.

Metric	PromQL	What It Shows
CPU per pod	`sum by(pod)(rate(container_cpu_usage_seconds_total{namespace="prod",container!=""}[5m]))`	CPU cores used per pod
Memory per namespace	`sum by(namespace)(container_memory_working_set_bytes{container!=""})`	Total working memory by namespace
Pod restart count	`kube_pod_container_status_restarts_total`	Cumulative restarts per container
Pods not ready	`kube_pod_status_ready{condition="false"}`	Pods failing readiness probes
PVC usage %	`kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes * 100`	Disk usage per PersistentVolumeClaim
Node CPU %	`100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)`	Node CPU utilization
Node memory %	`(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100`	Node memory utilization
Container OOMKilled	`kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}`	Containers killed by OOM
Deployment replica mismatch	`kube_deployment_spec_replicas != kube_deployment_status_available_replicas`	Deployments not at desired replica count
HTTP request rate	`sum(rate(http_requests_total[5m]))`	Requests per second (requires app-level exporter)

A few PromQL patterns worth internalizing: rate() must always wrap a counter metric and requires a range selector like [5m]. Use sum by(label) to aggregate across dimensions. The container!="" filter excludes the pause container that Kubernetes creates for every pod, which would otherwise skew your numbers.

Configure Grafana Unified Alerting

Grafana Unified Alerting (enabled by default since Grafana 11) lets you create alert rules, define notification routing, and manage silences entirely from the Grafana UI. It works with any Grafana data source, not just Prometheus, making it more flexible than Alertmanager for teams that consolidate on Grafana as their single monitoring interface.

Create a Contact Point (Slack)

Contact points define where alert notifications get delivered. Navigate to Alerting > Contact points in the Grafana sidebar, then click Add contact point.

Configure the Slack integration:

Name: slack-ops
Integration: Slack
Webhook URL: paste your Slack incoming webhook URL
Channel: #ops-alerts (optional override)

Click Test to send a test notification to the channel, then Save contact point.

To get a Slack webhook URL: go to api.slack.com, create a new app (or use an existing one), enable Incoming Webhooks, click Add New Webhook to Workspace, select the target channel, and copy the generated URL.

For PagerDuty, use the same flow but select PagerDuty as the integration type and paste your routing key from the PagerDuty service integration settings. For Email, select the Email integration and configure SMTP either in grafana.ini or via Helm values under grafana.smtp.

Set Up Notification Policies

Notification policies control the routing logic: which alerts go to which contact point, how they’re grouped, and how often repeat notifications fire. Navigate to Alerting > Notification policies.

The default policy sends all alerts to whatever contact point is set as default. To route specific alerts differently, add nested policies that match on labels:

severity=critical routes to PagerDuty (pages on-call)
severity=warning routes to Slack (informational)
team=platform routes to #platform-alerts channel

Group alerts by namespace and alertname to avoid notification floods. Set the group wait to 30 seconds, group interval to 5 minutes, and repeat interval to 4 hours. These values work well for most teams because they batch related alerts together without delaying critical notifications too long.

For planned maintenance, use Alerting > Silences to temporarily suppress notifications by matching label selectors.

Create Custom Alert Rules

Navigate to Alerting > Alert rules and click New alert rule. Each rule needs a PromQL query, a condition threshold, and a pending period (how long the condition must be true before firing). Here are three production-relevant rules tested on our k3s cluster.

Rule 1: High Pod Memory Usage (>80%)

This fires when any container in the production namespace uses more than 80% of its memory limit for 5 consecutive minutes. Catching memory pressure early prevents OOMKills.

(container_memory_working_set_bytes{namespace="production",container!=""} / container_spec_memory_limit_bytes{namespace="production",container!=""}) * 100

Condition: when last() of query is above 80
Pending period: 5 minutes
Labels: severity=warning, team=platform
Summary annotation: Pod {{ $labels.pod }} memory usage is above 80% of its limit

Rule 2: Pod Restarting Frequently

More than 2 restarts in 15 minutes usually indicates a crash loop. This catches flapping pods that Kubernetes keeps restarting without human intervention.

increase(kube_pod_container_status_restarts_total{namespace="production"}[15m])

Condition: when last() of query is above 2
Pending period: 5 minutes
Labels: severity=critical, team=platform
Summary annotation: Pod {{ $labels.pod }} has restarted {{ $value }} times in 15 minutes

Rule 3: Node CPU Above 80%

Sustained high CPU on a node means workloads may get throttled or new pods can’t be scheduled. The 10-minute pending period filters out short bursts that resolve on their own.

100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Condition: when last() of query is above 80
Pending period: 10 minutes
Labels: severity=critical, team=infra
Summary annotation: Node {{ $labels.instance }} CPU usage above 80% for 10 minutes

After creating all three rules, the alert rules page shows them alongside the 245 built-in rules from kube-prometheus-stack. The built-in rules cover Kubernetes internals, Prometheus health, and node_exporter metrics, so you only need to add rules specific to your application workloads.

Alertmanager vs Grafana Unified Alerting

kube-prometheus-stack deploys both Alertmanager and Grafana with Unified Alerting enabled. This creates some overlap, so understanding when to use each matters.

Feature	Grafana Unified Alerting	Alertmanager
Configuration method	Grafana UI or provisioning API	ConfigMap YAML (alertmanager.yaml)
Data sources	Any Grafana data source (Prometheus, Loki, PostgreSQL, etc.)	Prometheus only
Alert routing	Notification policies in the Grafana UI	Route tree in alertmanager.yaml
Silencing	Grafana Silences UI	Alertmanager UI or amtool CLI
Grouping	By labels via notification policies	By labels via route config
Best for	Teams using Grafana as the primary monitoring UI	GitOps and YAML-first teams managing config in Git

Most teams pick one and stick with it. If you manage everything through Grafana and prefer clicking through a UI, Grafana Unified Alerting is the simpler path. If your team follows GitOps practices and wants alert configuration versioned in a Git repository alongside Helm values, Alertmanager with PrometheusRule CRDs is the better fit. Running both simultaneously works but creates confusion about which system owns which alerts.

Dashboard Provisioning via ConfigMaps (GitOps)

Dashboards created through the Grafana UI are stored in its internal database, which means they’re lost if the Grafana pod is recreated without persistent storage. The production-safe approach is to store dashboard JSON in Kubernetes ConfigMaps. The Grafana sidecar container (enabled by default in kube-prometheus-stack) watches for ConfigMaps labeled grafana_dashboard: "1" and automatically loads them into Grafana.

Export your custom dashboard JSON from Grafana (Dashboard Settings > JSON Model > Copy), then wrap it in a ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: custom-production-overview
  namespace: monitoring
  labels:
    grafana_dashboard: "1"
data:
  production-overview.json: |
    {
      "title": "K8s Production Overview",
      "uid": "e2a2945f-ce98-4eb1-b060-472f6f45f7ed",
      "panels": [ ... ],
      "templating": {
        "list": [
          {
            "name": "namespace",
            "type": "query",
            "query": "label_values(kube_pod_info, namespace)"
          }
        ]
      }
    }

Apply it to the cluster:

kubectl apply -f production-overview-configmap.yaml

The sidecar picks up the ConfigMap within a few seconds and the dashboard appears in Grafana. Because ConfigMaps are Kubernetes resources, you can version them in Git, deploy them with Helm or ArgoCD, and they persist across pod restarts without needing a separate PVC for Grafana.

For larger teams, store all custom dashboards in a dedicated Git repository and use a CI pipeline to apply the ConfigMaps. This gives you version history, pull request reviews on dashboard changes, and easy rollback if someone breaks a panel.

Production Alert Rules to Start With

The three custom rules created earlier cover application-level concerns. Here’s a broader set of alert rules that every Kubernetes cluster should have from day one. These catch the most common failure modes before they impact users.

Alert	PromQL	Severity	Pending
Pod CrashLooping	`increase(kube_pod_container_status_restarts_total[1h]) > 5`	critical	15m
Node Not Ready	`kube_node_status_condition{condition="Ready",status="true"} == 0`	critical	5m
PVC >90% Full	`kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes > 0.9`	warning	10m
Deployment Replica Mismatch	`kube_deployment_spec_replicas != kube_deployment_status_available_replicas`	warning	15m
High Memory (>85%)	`container_memory_working_set_bytes / container_spec_memory_limit_bytes > 0.85`	warning	5m
Node Disk >85%	`(1 - node_filesystem_avail_bytes / node_filesystem_size_bytes) > 0.85`	warning	10m
API Server Error Rate >3%	`sum(rate(apiserver_request_total{code=~"5.."}[5m])) / sum(rate(apiserver_request_total[5m])) > 0.03`	critical	5m

Set the pending period long enough to avoid alert noise from transient spikes but short enough to catch real problems before they cascade. The values above are reasonable starting points. Tune them based on your cluster’s baseline behavior after a few weeks of observation.

You can create these as Grafana alert rules through the UI (as shown earlier) or as PrometheusRule CRDs if you prefer the Alertmanager path:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: custom-k8s-alerts
  namespace: monitoring
  labels:
    release: kube-prometheus-stack
spec:
  groups:
    - name: custom.rules
      rules:
        - alert: PodCrashLooping
          expr: increase(kube_pod_container_status_restarts_total[1h]) > 5
          for: 15m
          labels:
            severity: critical
          annotations:
            summary: "Pod {{ $labels.pod }} is crash looping"
        - alert: NodeNotReady
          expr: kube_node_status_condition{condition="Ready",status="true"} == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Node {{ $labels.node }} is not ready"

The release: kube-prometheus-stack label is important. Without it, the Prometheus Operator won’t pick up the PrometheusRule resource. This label must match the Helm release name used during installation.

Prometheus targets (port 30090 in our setup) confirms all exporters are healthy and scraping correctly. Verify this page after deploying new ServiceMonitors or if alerts stop firing unexpectedly.

The Alertmanager UI (port 30093) shows currently firing and silenced alerts. Even if you primarily use Grafana for alert management, checking Alertmanager directly is useful for debugging routing issues.

What’s Next

With dashboards and alerting in place, the monitoring stack covers the core observability needs for most Kubernetes clusters. A few directions to explore from here:

Long-term metrics storage with Grafana Mimir or Thanos. Prometheus retains data for 15 days by default, which isn’t enough for capacity planning or trend analysis over months
Log-based alerting with Grafana Loki. Alert on application log patterns (error rates, specific exception strings) alongside metric-based alerts
Distributed tracing with Grafana Tempo for request-level visibility across microservices
SLO tracking using the Pyrra or Sloth projects, which generate recording and alerting rules from SLO definitions
Custom application metrics by instrumenting your code with Prometheus client libraries and creating ServiceMonitor resources to scrape them

For the full Grafana Alerting documentation, including advanced features like multi-dimensional alerting and recording rules, refer to the official docs.