Deploy Prometheus and Grafana on EKS

Q: Can I use Amazon Managed Prometheus instead?

Yes. AMP is a managed Prometheus-compatible service. Configure remoteWrite in the Prometheus spec to ship metrics to AMP, then point Grafana at AMP as a data source.

Q: How do I expose Grafana externally with HTTPS?

Create an Ingress resource with alb.ingress.kubernetes.io annotations and an ACM certificate ARN. Add OIDC authentication annotations to restrict access to your team.

Q: What is the storage cost for 15 days of metrics on a small cluster?

A 2-node cluster generates about 2-3 GiB of TSDB data over 15 days. With gp3 at $0.08/GiB-month, that is under $1/month for the Prometheus volume. Grafana and AlertManager add about $0.80/month.

Running an EKS cluster without monitoring is like driving without a dashboard. You know something is wrong only when the engine catches fire. The kube-prometheus-stack Helm chart bundles Prometheus, Grafana, Alertmanager, and all the exporters you need into a single install that takes about five minutes to deploy and gives you 34 dashboards on day one.

Original content from computingforgeeks.com - post 165827

This guide covers installing the full stack on EKS, tuning it for the EKS-managed control plane (where etcd and scheduler metrics are not exposed), setting up persistent storage with the EBS CSI driver, walking through the most useful dashboards with real screenshots, and configuring AlertManager for Slack notifications. If you are running Karpenter for node auto-scaling, the node-level dashboards become especially valuable for tracking Spot instance churn and consolidation patterns.

Current as of April 2026. Verified on EKS 1.33.8-eks-f69f56f with kube-prometheus-stack Helm chart, eu-west-1

What kube-prometheus-stack Installs

The chart is a meta-package. One helm install deploys all of these:

Component	Workload Type	Purpose
Prometheus Operator	Deployment	Manages Prometheus and AlertManager CRDs
Prometheus	StatefulSet	Scrapes and stores metrics (TSDB)
Alertmanager	StatefulSet	Routes and deduplicates alerts
Grafana	Deployment	Visualization and dashboards
kube-state-metrics	Deployment	Exposes Kubernetes object state as metrics
prometheus-node-exporter	DaemonSet	Host-level CPU, memory, disk, network metrics
Prometheus Adapter (optional)	Deployment	Custom metrics for HPA

On our 2-node test cluster, this came out to 7 running pods: 5 singletons plus one node-exporter pod per node.

Prerequisites

EKS cluster running Kubernetes 1.27+ (tested on 1.33.8)
EBS CSI driver installed with an IRSA role. Prometheus uses a PersistentVolumeClaim for its TSDB, and the default gp2 StorageClass on EKS requires the EBS CSI driver. Without it, PVCs stay in Pending state forever. See the IRSA guide for setting up the driver role
A gp3 StorageClass (recommended over gp2 for better IOPS baseline at no extra cost)
Helm 3.12+
kubectl configured with cluster access

Create the gp3 StorageClass if it does not exist:

cat < gp3-sc.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gp3
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  fsType: ext4
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
EOF
kubectl apply -f gp3-sc.yaml

Install kube-prometheus-stack

Add the Helm repository and create a values file. EKS manages the control plane, which means etcd, kube-controller-manager, and kube-scheduler metrics are not accessible. Leaving those scrapers enabled generates noisy error logs and false alerts.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

Create the EKS-tuned values file:

cat < prom-values.yaml
# Disable control plane components (EKS manages these)
kubeEtcd:
  enabled: false
kubeControllerManager:
  enabled: false
kubeScheduler:
  enabled: false

# Prometheus
prometheus:
  prometheusSpec:
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: gp3
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 10Gi
    retention: 15d
    resources:
      requests:
        cpu: 500m
        memory: 1Gi
      limits:
        memory: 2Gi

# Grafana
grafana:
  adminPassword: "CfgLabGrafana2026!"
  persistence:
    enabled: true
    storageClassName: gp3
    size: 5Gi

# Alertmanager
alertmanager:
  alertmanagerSpec:
    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: gp3
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 5Gi
EOF

Install the chart into a dedicated namespace:

helm upgrade --install kube-prom-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace \
  -f prom-values.yaml \
  --wait

Verify all pods are running:

kubectl get pods -n monitoring

All 7 pods should show 1/1 or 2/2 Ready:

NAME                                                        READY   STATUS    RESTARTS   AGE
alertmanager-kube-prom-stack-alertmanager-0                  2/2     Running   0          2m
kube-prom-stack-grafana-6d8f9c7b4d-k2x9p                    3/3     Running   0          2m
kube-prom-stack-kube-state-metrics-5f8b7d6c4-m7n3q           1/1     Running   0          2m
kube-prom-stack-prometheus-node-exporter-abc12                1/1     Running   0          2m
kube-prom-stack-prometheus-node-exporter-def34                1/1     Running   0          2m
kube-prom-stack-operator-7f9b8c6d5-p4r2t                     1/1     Running   0          2m
prometheus-kube-prom-stack-prometheus-0                       2/2     Running   0          2m

Confirm the PVCs are Bound (this proves the EBS CSI driver is working):

kubectl get pvc -n monitoring

Each PVC should show Bound status with the gp3 StorageClass:

NAME                                                         STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
prometheus-kube-prom-stack-prometheus-db-prometheus-0         Bound    pvc-a1b2c3d4-e5f6-7890-abcd-ef1234567890   10Gi       RWO            gp3            2m
alertmanager-kube-prom-stack-alertmanager-db-alertmanager-0   Bound    pvc-b2c3d4e5-f6a7-8901-bcde-f12345678901   5Gi        RWO            gp3            2m
kube-prom-stack-grafana                                       Bound    pvc-c3d4e5f6-a7b8-9012-cdef-012345678901   5Gi        RWO            gp3            2m

Access Grafana

Port-forward to the Grafana service. For production clusters, expose it through an ALB Ingress instead (the AWS Load Balancer Controller guide covers that setup).

kubectl port-forward -n monitoring svc/kube-prom-stack-grafana 3000:80

Open http://localhost:3000 in your browser. Log in with username admin and the password from the values file.

Grafana login page. Use the admin credentials set in the Helm values file.

After logging in, the home screen shows a search bar and recently viewed dashboards. The kube-prometheus-stack ships with 34 pre-built dashboards covering every layer of the cluster.

Grafana home screen with 34 pre-configured dashboards ready to explore.

Dashboard Tour

Navigate to Dashboards and browse the folder labeled “Default”. Here are the dashboards that matter most on EKS.

Cluster Resources

The “Kubernetes / Compute Resources / Cluster” dashboard gives you the high-level picture. On our 2-node test cluster, it showed 3.42% CPU usage, 27.0% memory usage, and 54.7% memory limits commitment. These numbers tell you both current utilization and how much headroom your resource requests leave.

Cluster-level resource dashboard showing CPU and memory utilization across all namespaces. The 54.7% memory limits commitment means pods have requested just over half the cluster’s total memory.

Node Resources

Drill into individual nodes to see per-pod resource consumption. This is where you spot noisy neighbors, pods that request 500m CPU but never use more than 10m.

Node-level breakdown showing which pods consume the most resources. Use this to right-size resource requests.

Pod Resources

The pod-level dashboard shows container-level resource usage, requests, and limits over time. Essential for tuning VPA recommendations or catching memory leaks.

Container-level resource view showing actual usage against configured requests and limits over time.

Namespace Resources

Compare resource consumption across namespaces. On a multi-tenant cluster, this dashboard answers “which team is using the most capacity” without querying the API server.

Namespace-level resource comparison. Useful for chargeback and capacity planning on shared clusters.

Node Exporter

The node-exporter dashboard shows host-level metrics: CPU by mode (user, system, iowait), memory breakdown (used, cached, buffers), disk throughput, and network I/O. Our test node showed 27.4% memory usage.

Node exporter gives host-level visibility that Kubernetes metrics alone cannot provide, including disk IOPS and network throughput.

CoreDNS

DNS latency is a silent cluster killer. The CoreDNS dashboard shows request rate, cache hit ratio, and per-upstream latency. On EKS, high DNS latency often points to the VPC resolver throttle (1024 packets/second per ENI).

CoreDNS performance dashboard. Watch the cache hit ratio and upstream latency to catch DNS issues before they cascade into application timeouts.

Persistent Volumes

Track PV usage, capacity, and inode consumption. This dashboard alerts you before an EBS volume fills up and causes a Prometheus TSDB crash.

PV usage for the monitoring namespace. Set alerts at 80% capacity to avoid data loss from full volumes.

AlertManager

The AlertManager dashboard shows active alerts, notification success/failure rates, and silenced alerts. Useful for verifying that Slack notifications are actually being delivered.

AlertManager overview showing alert delivery status. Check the notification failures panel if Slack alerts stop arriving.

Prometheus Self-Monitoring

Prometheus monitors itself. This dashboard shows scrape duration, sample ingestion rate, TSDB block count, WAL size, and head chunks. When Prometheus gets slow, check here first.

Prometheus TSDB health. High head series count or increasing WAL size means it is time to reduce cardinality or add more memory.

Grafana Overview

Finally, the Grafana overview shows dashboard render times, active sessions, and data source query latency. Not critical for day-to-day ops, but helpful when debugging slow-loading dashboards.

Grafana self-monitoring. If dashboard load times exceed 5 seconds, check the data source query latency panel.

Configure AlertManager for Slack

Alerts are only useful if someone sees them. Add Slack webhook configuration to your Helm values and upgrade the release.

Add this block under the alertmanager section in prom-values.yaml:

alertmanager:
  config:
    global:
      resolve_timeout: 5m
    route:
      receiver: 'slack-notifications'
      group_by: ['alertname', 'namespace']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 4h
    receivers:
      - name: 'slack-notifications'
        slack_configs:
          - api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
            channel: '#alerts-eks'
            send_resolved: true
            title: '{{ .GroupLabels.alertname }}'
            text: >-
              {{ range .Alerts }}
              *Alert:* {{ .Labels.alertname }}
              *Namespace:* {{ .Labels.namespace }}
              *Severity:* {{ .Labels.severity }}
              *Description:* {{ .Annotations.description }}
              {{ end }}

Apply the updated values:

helm upgrade kube-prom-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  -f prom-values.yaml \
  --wait

Verify by triggering a test alert. Kill a node-exporter pod and watch for the Slack message within 30 seconds.

Persistent Storage Considerations

The Prometheus TSDB stores all scraped metrics on the PVC. With a 10Gi volume and 15-day retention, a 2-node cluster uses about 2-3 GiB. For larger clusters (50+ nodes with many custom metrics), plan for 50-100Gi and consider increasing the retention period only if you have the storage budget.

EBS volumes are AZ-bound. If the Prometheus pod moves to a node in a different AZ, the PVC cannot follow. Use a nodeAffinity or topologySpreadConstraints to pin Prometheus to nodes in the same AZ as its volume. Alternatively, use EBS multi-attach (io2 Block Express) or switch to EFS for cross-AZ access, though EFS latency is higher.

Production Recommendations

For clusters with more than 20 nodes, consider these adjustments:

Increase Prometheus memory: set requests to 2Gi and limits to 4Gi. Each active time series consumes about 1-2 KB of memory in the head block
Use remote write: ship metrics to a managed service (Amazon Managed Prometheus, Grafana Cloud, Thanos) for long-term retention beyond what a single PVC can handle
Scrape interval tuning: the default 30-second scrape interval is fine for most use cases. Reducing it to 15s doubles storage and memory consumption
Drop high-cardinality metrics: use metricRelabelings to drop metrics you never query. The apiserver_request_duration_seconds_bucket metric alone generates thousands of time series
Separate Grafana ingress: expose Grafana through an ALB with OIDC authentication instead of port-forwarding. The AWS Load Balancer Controller guide shows how to set up ALB Ingress with SSL termination

For cost visibility into the monitoring stack itself, Kubecost can show exactly how much Prometheus and Grafana cost per month in compute, storage, and network.

Troubleshooting

PVC stuck in Pending state

This is the most common issue on fresh EKS clusters. It means the EBS CSI driver is either not installed or its IRSA role does not have the ec2:CreateVolume permission. Check the CSI driver pods:

kubectl get pods -n kube-system -l app=ebs-csi-controller

If no pods are listed, install the EBS CSI driver addon. If the pods exist but the PVC is still Pending, describe the PVC and check events:

kubectl describe pvc -n monitoring prometheus-kube-prom-stack-prometheus-db-prometheus-0

Look for “could not create volume in EC2” or “Access Denied” messages pointing to the IRSA role.

Node-exporter pods stuck in Pending with taints

The node-exporter DaemonSet needs to run on every node, including those with taints (like Karpenter-managed Spot nodes). Add tolerations to the DaemonSet via Helm values:

prometheus-node-exporter:
  tolerations:
    - operator: Exists

This tells the DaemonSet to tolerate all taints, ensuring coverage on every node.

Missing control plane metrics (etcd, scheduler, controller-manager)

EKS is a managed control plane. AWS does not expose etcd, kube-scheduler, or kube-controller-manager endpoints to customers. The dashboards for these components will show “No data”. This is expected behavior, not a bug. Disable the corresponding scrape jobs in your Helm values (which we already did) to stop the noisy scrape-failure alerts.

Prometheus OOMKilled

If Prometheus restarts with OOMKilled, the memory limit is too low for the number of active time series. Check the current series count:

kubectl exec -n monitoring prometheus-kube-prom-stack-prometheus-0 -c prometheus -- \
  promtool tsdb analyze /prometheus

Increase the memory limit in your Helm values, or reduce cardinality by dropping expensive metrics with metricRelabelings.

Cleanup

Uninstall the Helm release and delete the namespace. The PVCs are not deleted automatically, so remove them explicitly:

helm uninstall kube-prom-stack -n monitoring
kubectl delete pvc -n monitoring --all
kubectl delete namespace monitoring

The EBS volumes backing the PVCs will be deleted by the CSI driver since the reclaim policy is Delete.

FAQ

How much memory does Prometheus need on EKS?

As a baseline, 1Gi handles a 5-node cluster with default scrape targets. Each additional node adds roughly 500-1000 active time series from the node-exporter. For a 50-node cluster, allocate at least 4Gi. The prometheus_tsdb_head_series metric tells you the exact count.

Can I use Amazon Managed Prometheus instead?

Yes. AMP is a managed Prometheus-compatible service that handles storage and scaling. You configure remoteWrite in the Prometheus spec to ship metrics to AMP, then point Grafana at AMP as a data source. You still run Grafana yourself (or use Amazon Managed Grafana). The kube-prometheus-stack chart supports this via the prometheus.prometheusSpec.remoteWrite array.

Why are etcd dashboards empty on EKS?

EKS is a managed control plane. AWS runs etcd, kube-scheduler, and kube-controller-manager in their own infrastructure and does not expose the metrics endpoints. Disable these scrapers in the Helm values to avoid false alerts.

How do I expose Grafana externally with HTTPS?

Create an Ingress resource with the alb.ingress.kubernetes.io annotations and an ACM certificate ARN. The AWS Load Balancer Controller guide covers this pattern. Add OIDC authentication annotations to restrict access to your team.

What is the storage cost for 15 days of metrics on a small cluster?

A 2-node cluster with default scrape targets generates about 2-3 GiB of TSDB data over 15 days. With gp3 at $0.08/GiB-month, that is under $1/month for the Prometheus volume alone. Grafana and AlertManager volumes add another $0.80/month. For detailed cost breakdowns, the AWS costs guide covers EBS pricing tiers.