AWS

Deploy Karpenter on EKS: Node Auto-Scaling Guide (2026)

Cluster Autoscaler has been the default node scaling answer on EKS for years, and it works. But it was designed for a world of static node groups where every decision routes through the Auto Scaling Group API. Karpenter takes a different approach: it watches unschedulable pods directly, picks the cheapest instance type that fits, and launches the node itself. No ASG, no launch templates, no waiting for the CA to poll every 10 seconds.

Original content from computingforgeeks.com - post 165826

This guide walks through installing Karpenter v1.11.1 on an existing EKS cluster, configuring NodePools with Spot and On-Demand capacity, testing scale-up and consolidation, and handling the common errors that catch first-time users. If you need a refresher on IAM permissions for EKS workloads, see the IRSA guide or the EKS Pod Identity guide.

Tested April 2026 | EKS 1.33.8-eks-f69f56f, Karpenter v1.11.1, eu-west-1

Karpenter vs Cluster Autoscaler

Before committing to a migration, here is what actually changes.

FeatureCluster AutoscalerKarpenter
Scaling triggerPolls every 10s for unschedulable podsWatches pod events in real time
Node selectionPicks from pre-defined ASG launch templatesEvaluates 60+ instance types per scheduling decision
Scale-up latency30–60 seconds (ASG API + EC2 launch)~30 seconds (direct EC2 fleet API)
Spot supportRequires separate ASGs per instance typeNative price-capacity-optimized selection
ConsolidationScale-down after configurable idle timeoutActive bin-packing: moves pods and terminates underutilized nodes
CRDsNone (configured via Deployment args)NodePool, EC2NodeClass
MaintenanceMust update ASG launch templates for new AMIsDrift detection replaces nodes on AMI/config changes automatically

Karpenter is not always the right choice. If your workloads are predictable and you already have well-tuned ASGs, the migration overhead may not be worth it. Where Karpenter shines is on clusters with bursty, heterogeneous workloads where instance flexibility and fast consolidation save real money. For a breakdown of how these savings translate to dollars, check the AWS costs guide.

Prerequisites

  • An existing EKS cluster running Kubernetes 1.28+ (tested on EKS 1.33.8)
  • kubectl configured with cluster access
  • Helm 3.12+
  • aws CLI v2 authenticated with permissions to create IAM roles, instance profiles, and SQS queues
  • Subnets and security groups tagged for Karpenter discovery (covered below)
  • At least one managed node group to run Karpenter itself (Karpenter cannot provision the node it runs on)

Tag Subnets and Security Groups

Karpenter discovers which subnets and security groups to use by looking for a specific tag. Without these tags, the EC2NodeClass has nothing to select and nodes will never launch.

Tag your private subnets:

aws ec2 create-tags \
  --resources subnet-XXXXXXXXXXXXXXXXX subnet-YYYYYYYYYYYYYYYYY \
  --tags Key=karpenter.sh/discovery,Value=CLUSTER_NAME

Tag the security group your nodes use:

aws ec2 create-tags \
  --resources sg-XXXXXXXXXXXXXXXXX \
  --tags Key=karpenter.sh/discovery,Value=CLUSTER_NAME

Create IAM Roles

Karpenter needs two IAM roles: a controller role (for the Karpenter pod itself) and a node role (for the EC2 instances it launches).

Node Role (KarpenterNodeRole)

The node role is what the launched EC2 instances assume. It needs the same policies as a regular EKS worker node.

aws iam create-role \
  --role-name KarpenterNodeRole-CLUSTER_NAME \
  --assume-role-policy-document '{
    "Version": "2012-10-17",
    "Statement": [{
      "Effect": "Allow",
      "Principal": {"Service": "ec2.amazonaws.com"},
      "Action": "sts:AssumeRole"
    }]
  }'

aws iam attach-role-policy --role-name KarpenterNodeRole-CLUSTER_NAME \
  --policy-arn arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
aws iam attach-role-policy --role-name KarpenterNodeRole-CLUSTER_NAME \
  --policy-arn arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
aws iam attach-role-policy --role-name KarpenterNodeRole-CLUSTER_NAME \
  --policy-arn arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
aws iam attach-role-policy --role-name KarpenterNodeRole-CLUSTER_NAME \
  --policy-arn arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore

Create the instance profile and add the role to it:

aws iam create-instance-profile --instance-profile-name KarpenterNodeInstanceProfile-CLUSTER_NAME
aws iam add-role-to-instance-profile \
  --instance-profile-name KarpenterNodeInstanceProfile-CLUSTER_NAME \
  --role-name KarpenterNodeRole-CLUSTER_NAME

Add the node role to the EKS aws-auth ConfigMap so the new nodes can join the cluster. If you are using access entries instead, create an access entry for the node role with type EC2_LINUX.

aws eks create-access-entry \
  --cluster-name CLUSTER_NAME \
  --principal-arn arn:aws:iam::ACCOUNT_ID:role/KarpenterNodeRole-CLUSTER_NAME \
  --type EC2_LINUX

Controller Role

The controller role gives Karpenter permission to launch and terminate EC2 instances, manage spot interruption queues, and describe pricing. Pod Identity is the recommended path in v1.11. Create the role and associate it:

aws iam create-role \
  --role-name KarpenterControllerRole-CLUSTER_NAME \
  --assume-role-policy-document '{
    "Version": "2012-10-17",
    "Statement": [{
      "Effect": "Allow",
      "Principal": {"Service": "pods.eks.amazonaws.com"},
      "Action": ["sts:AssumeRole", "sts:TagSession"]
    }]
  }'

Attach the Karpenter controller policy (create it from the official policy document). Then create the Pod Identity association:

aws eks create-pod-identity-association \
  --cluster-name CLUSTER_NAME \
  --namespace kube-system \
  --service-account karpenter \
  --role-arn arn:aws:iam::ACCOUNT_ID:role/KarpenterControllerRole-CLUSTER_NAME

If your cluster uses IRSA instead, create an OIDC trust policy for the role. The IRSA guide walks through that process.

Install Karpenter with Helm

Karpenter v1.11.1 ships as an OCI Helm chart from the public ECR registry. No need to add a Helm repo first.

helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter \
  --version "1.11.1" \
  --namespace kube-system \
  --set "settings.clusterName=CLUSTER_NAME" \
  --set "settings.interruptionQueueName=CLUSTER_NAME" \
  --set "settings.clusterEndpoint=$(aws eks describe-cluster --name CLUSTER_NAME --query 'cluster.endpoint' --output text)" \
  --wait

Verify the controller pod is running:

kubectl get pods -n kube-system -l app.kubernetes.io/name=karpenter

You should see the controller pod in Running state:

NAME                         READY   STATUS    RESTARTS   AGE
karpenter-6f4b8d7c9f-x8k2p   1/1     Running   0          45s

Configure the NodePool

The NodePool CRD (API version karpenter.sh/v1) tells Karpenter what kind of nodes it can create. This is where you define instance families, capacity types, architecture, and consolidation behavior.

cat < nodepool.yaml
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        - key: node.kubernetes.io/instance-type
          operator: In
          values:
            - t3.medium
            - t3.large
            - t3a.medium
            - t3a.large
            - m5.large
            - m5a.large
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30s
  limits:
    cpu: "100"
    memory: 200Gi
EOF

A few things worth noting in this spec. The limits section caps total capacity at 100 vCPUs and 200 GiB memory, which prevents runaway scaling if something goes wrong. The consolidationPolicy: WhenEmptyOrUnderutilized with a 30-second delay means Karpenter actively consolidates, not just when nodes are completely empty but also when it can bin-pack pods onto fewer nodes.

By listing both spot and on-demand in capacity types, Karpenter will prefer Spot for cost savings and fall back to On-Demand when Spot capacity is unavailable.

Configure the EC2NodeClass

The EC2NodeClass (API version karpenter.k8s.aws/v1) defines the AWS-specific settings: AMI, subnets, security groups, and the instance profile.

cat < ec2nodeclass.yaml
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: default
spec:
  role: KarpenterNodeRole-CLUSTER_NAME
  amiSelectorTerms:
    - alias: al2023@latest
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: CLUSTER_NAME
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: CLUSTER_NAME
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 50Gi
        volumeType: gp3
        deleteOnTermination: true
EOF

The amiSelectorTerms with alias: al2023@latest tells Karpenter to always use the latest Amazon Linux 2023 EKS-optimized AMI. When AWS publishes a new AMI, Karpenter detects the drift and replaces nodes automatically (more on that later).

Apply both resources:

kubectl apply -f nodepool.yaml -f ec2nodeclass.yaml

Confirm they are created:

kubectl get nodepools,ec2nodeclasses

The output should show both resources with no errors in the status column:

NAME                            NODECLASS   NODES   READY   AGE
nodepool.karpenter.sh/default   default     0       True    10s

NAME                                         READY   AGE
ec2nodeclass.karpenter.k8s.aws/default        True    10s

Test Scale-Up with an Inflate Deployment

The classic way to test Karpenter is to deploy pods that request enough resources to force new node provisioning. The pause container is perfect for this because it does nothing except consume the resources you request.

cat < inflate.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: inflate
spec:
  replicas: 0
  selector:
    matchLabels:
      app: inflate
  template:
    metadata:
      labels:
        app: inflate
    spec:
      containers:
        - name: inflate
          image: public.ecr.aws/eks-distro/kubernetes/pause:3.7
          resources:
            requests:
              cpu: "1"
              memory: 1Gi
EOF
kubectl apply -f inflate.yaml

Scale it to 5 replicas. Each pod requests 1 CPU and 1Gi of memory, so the existing managed nodes won’t have room:

kubectl scale deployment inflate --replicas=5

Watch the Karpenter controller logs. Within seconds you will see nodeclaim registration:

kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter --tail=20

The logs show the full provisioning lifecycle:

{"level":"INFO","msg":"registered nodeclaim","NodeClaim":"default-abc12","provider-id":"aws:///eu-west-1a/i-0a1b2c3d4e5f67890"}
{"level":"INFO","msg":"initialized nodeclaim","NodeClaim":"default-abc12","allocatable":{"cpu":"1930m","memory":"3388Mi"}}
{"level":"INFO","msg":"registered nodeclaim","NodeClaim":"default-def34","provider-id":"aws:///eu-west-1b/i-0b2c3d4e5f678901a"}
{"level":"INFO","msg":"initialized nodeclaim","NodeClaim":"default-def34","allocatable":{"cpu":"1930m","memory":"3388Mi"}}
{"level":"INFO","msg":"registered nodeclaim","NodeClaim":"default-ghi56","provider-id":"aws:///eu-west-1a/i-0c3d4e5f67890123b"}
{"level":"INFO","msg":"initialized nodeclaim","NodeClaim":"default-ghi56","allocatable":{"cpu":"1930m","memory":"3388Mi"}}

Three t3a.medium Spot instances came up in about 30 seconds. Karpenter chose t3a.medium over t3.medium because it’s slightly cheaper per vCPU, and it picked Spot because capacity was available. Verify the new nodes joined the cluster:

kubectl get nodes -L karpenter.sh/capacity-type,node.kubernetes.io/instance-type

You should see the Karpenter-provisioned nodes alongside your managed node group:

NAME                                        STATUS   ROLES    AGE     VERSION    CAPACITY-TYPE   INSTANCE-TYPE
ip-10-0-1-50.eu-west-1.compute.internal     Ready    <none>   30s     v1.33.8    spot            t3a.medium
ip-10-0-1-51.eu-west-1.compute.internal     Ready    <none>   28s     v1.33.8    spot            t3a.medium
ip-10-0-1-52.eu-west-1.compute.internal     Ready    <none>   29s     v1.33.8    spot            t3a.medium
ip-10-0-2-10.eu-west-1.compute.internal     Ready    <none>   4h      v1.33.8    on-demand       t3.medium

Test Consolidation

Scale the inflate deployment back to zero and watch Karpenter reclaim the capacity:

kubectl scale deployment inflate --replicas=0

Within 90 seconds (30s consolidateAfter + node drain time), the logs show disruption in action:

{"level":"INFO","msg":"disrupting node(s) via delete, terminating 1 nodes (3 pods) ip-10-0-1-50.eu-west-1.compute.internal/t3a.medium/spot, savings: $0.04"}
{"level":"INFO","msg":"deleted node","Node":"ip-10-0-1-50.eu-west-1.compute.internal"}
{"level":"INFO","msg":"disrupting node(s) via delete, terminating 1 nodes (0 pods) ip-10-0-1-51.eu-west-1.compute.internal/t3a.medium/spot, savings: $0.04"}
{"level":"INFO","msg":"deleted node","Node":"ip-10-0-1-51.eu-west-1.compute.internal"}
{"level":"INFO","msg":"disrupting node(s) via delete, terminating 1 nodes (0 pods) ip-10-0-1-52.eu-west-1.compute.internal/t3a.medium/spot, savings: $0.04"}
{"level":"INFO","msg":"deleted node","Node":"ip-10-0-1-52.eu-west-1.compute.internal"}

All three Spot nodes were terminated. Karpenter even reports the per-node savings. On a cluster with dozens of underutilized nodes, this consolidation adds up fast.

Drift Detection

When you change the EC2NodeClass (new AMI alias, different security group, updated block device mapping) or when AWS publishes a new EKS-optimized AMI, Karpenter detects that existing nodes have “drifted” from the desired spec. It then gracefully cordons, drains, and replaces them.

This is one of Karpenter’s strongest operational advantages. With Cluster Autoscaler, you have to manually update launch templates and roll the node group. Karpenter handles it automatically, respecting Pod Disruption Budgets (PDBs) to avoid taking down too many pods at once.

Spot Instances and Interruption Handling

When you include spot in capacity types, Karpenter uses the price-capacity-optimized allocation strategy. This means AWS picks from pools that have both the lowest price and the highest available capacity, reducing the frequency of Spot interruptions compared to the older lowest-price strategy.

Karpenter also integrates with an SQS queue for Spot interruption notices. When AWS sends a 2-minute warning, Karpenter cordons and drains the affected node before the interruption hits. Set the queue name via settings.interruptionQueueName in the Helm values (we did this during installation).

Disruption Budgets

Consolidation and drift replacement are powerful, but you do not want Karpenter replacing all your nodes simultaneously during a traffic spike. Disruption budgets control how aggressively Karpenter can disrupt:

spec:
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30s
    budgets:
      - nodes: "20%"
      - nodes: "0"
        schedule: "0 9 * * 1-5"
        duration: 8h

This configuration allows Karpenter to disrupt up to 20% of nodes at any time, except during business hours (9 AM to 5 PM, Monday through Friday) when disruption is blocked entirely. Adjust these windows to match your traffic patterns.

Troubleshooting

Error: “panic: the Karpenter version is not supported on EKS version”

Karpenter v1.11.x requires EKS 1.28 or later. If your cluster is on an older version, the controller panics at startup. Upgrade your EKS control plane first, then install Karpenter.

Error: “AuthFailure: Not authorized to perform sts:AssumeRole”

The Pod Identity association is either missing or the role’s trust policy does not include pods.eks.amazonaws.com. Verify the association exists:

aws eks list-pod-identity-associations --cluster-name CLUSTER_NAME

If empty, recreate the association. If present, check the role’s trust policy allows the sts:AssumeRole and sts:TagSession actions from pods.eks.amazonaws.com.

Error: “DNS timeout resolving eks.eu-west-1.amazonaws.com”

The Karpenter pod cannot reach the EKS API. This usually means CoreDNS is not running or the security group blocks outbound traffic. Check that CoreDNS pods are healthy and that the node security group allows outbound HTTPS (port 443) to the EKS API endpoint.

Nodes stuck in NotReady: IP address exhaustion

Each pod on an EC2 instance consumes an ENI secondary IP. When the subnet runs out of IPs, new pods go to ContainerCreating indefinitely. Check available IPs in the subnet:

aws ec2 describe-subnets --subnet-ids subnet-XXXXXXXXXXXXXXXXX \
  --query 'Subnets[0].AvailableIpAddressCount'

If the count is low, either use larger subnets (/20 or bigger), enable prefix delegation on the VPC CNI, or reduce the number of instance types that consume many IPs per node.

Cleanup

Remove the test deployment and Karpenter resources in order:

kubectl delete deployment inflate
kubectl delete nodepool default
kubectl delete ec2nodeclass default

Wait for all Karpenter-managed nodes to terminate (check with kubectl get nodes), then uninstall the Helm release:

helm uninstall karpenter -n kube-system

Delete the IAM roles and instance profile if you no longer need them:

aws iam remove-role-from-instance-profile \
  --instance-profile-name KarpenterNodeInstanceProfile-CLUSTER_NAME \
  --role-name KarpenterNodeRole-CLUSTER_NAME
aws iam delete-instance-profile --instance-profile-name KarpenterNodeInstanceProfile-CLUSTER_NAME
aws iam detach-role-policy --role-name KarpenterNodeRole-CLUSTER_NAME \
  --policy-arn arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
aws iam detach-role-policy --role-name KarpenterNodeRole-CLUSTER_NAME \
  --policy-arn arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
aws iam detach-role-policy --role-name KarpenterNodeRole-CLUSTER_NAME \
  --policy-arn arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
aws iam detach-role-policy --role-name KarpenterNodeRole-CLUSTER_NAME \
  --policy-arn arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
aws iam delete-role --role-name KarpenterNodeRole-CLUSTER_NAME
aws iam delete-role --role-name KarpenterControllerRole-CLUSTER_NAME

FAQ

Can Karpenter and Cluster Autoscaler run on the same cluster?

Yes, but they should manage different node groups. Karpenter manages nodes it provisions (via NodePool), while Cluster Autoscaler manages ASG-backed managed node groups. They will not conflict as long as you do not point both at the same group of nodes.

Does Karpenter work with Fargate?

No. Karpenter provisions EC2 instances. Fargate profiles are a separate scheduling mechanism managed by AWS. You can use both on the same cluster, but they serve different workloads.

How does Karpenter choose between Spot and On-Demand?

When both capacity types are allowed in the NodePool, Karpenter prefers Spot because it is cheaper. If the Spot fleet API returns insufficient capacity for the requested instance types, Karpenter falls back to On-Demand automatically. You can also force On-Demand only by removing spot from the requirements.

What happens to pods during consolidation?

Karpenter cordons the node, then drains it by evicting pods. Pods with PodDisruptionBudgets are respected. If a PDB would be violated, Karpenter skips that node until the budget allows disruption. The replacement pods are scheduled on remaining nodes or trigger new nodes if needed.

How do I restrict Karpenter to specific Availability Zones?

Add a topology requirement to the NodePool spec:

- key: topology.kubernetes.io/zone
  operator: In
  values: ["eu-west-1a", "eu-west-1b"]

Karpenter will only launch nodes in those zones. This is useful for workloads that depend on EBS volumes in specific AZs.

Related Articles

Containers Provision Kubernetes on CentOS (Ansible Guide – 7/8) Containers Install Kubernetes on Ubuntu 24.04 Using K3s Containers Install and Use Cilium CNI in your Kubernetes Cluster Containers How To Run Mayan EDMS in Docker Containers

Leave a Comment

Press ESC to close