Cluster Autoscaler has been the default node scaling answer on EKS for years, and it works. But it was designed for a world of static node groups where every decision routes through the Auto Scaling Group API. Karpenter takes a different approach: it watches unschedulable pods directly, picks the cheapest instance type that fits, and launches the node itself. No ASG, no launch templates, no waiting for the CA to poll every 10 seconds.
This guide walks through installing Karpenter v1.11.1 on an existing EKS cluster, configuring NodePools with Spot and On-Demand capacity, testing scale-up and consolidation, and handling the common errors that catch first-time users. If you need a refresher on IAM permissions for EKS workloads, see the IRSA guide or the EKS Pod Identity guide.
Tested April 2026 | EKS 1.33.8-eks-f69f56f, Karpenter v1.11.1, eu-west-1
Karpenter vs Cluster Autoscaler
Before committing to a migration, here is what actually changes.
| Feature | Cluster Autoscaler | Karpenter |
|---|---|---|
| Scaling trigger | Polls every 10s for unschedulable pods | Watches pod events in real time |
| Node selection | Picks from pre-defined ASG launch templates | Evaluates 60+ instance types per scheduling decision |
| Scale-up latency | 30–60 seconds (ASG API + EC2 launch) | ~30 seconds (direct EC2 fleet API) |
| Spot support | Requires separate ASGs per instance type | Native price-capacity-optimized selection |
| Consolidation | Scale-down after configurable idle timeout | Active bin-packing: moves pods and terminates underutilized nodes |
| CRDs | None (configured via Deployment args) | NodePool, EC2NodeClass |
| Maintenance | Must update ASG launch templates for new AMIs | Drift detection replaces nodes on AMI/config changes automatically |
Karpenter is not always the right choice. If your workloads are predictable and you already have well-tuned ASGs, the migration overhead may not be worth it. Where Karpenter shines is on clusters with bursty, heterogeneous workloads where instance flexibility and fast consolidation save real money. For a breakdown of how these savings translate to dollars, check the AWS costs guide.
Prerequisites
- An existing EKS cluster running Kubernetes 1.28+ (tested on EKS 1.33.8)
kubectlconfigured with cluster access- Helm 3.12+
awsCLI v2 authenticated with permissions to create IAM roles, instance profiles, and SQS queues- Subnets and security groups tagged for Karpenter discovery (covered below)
- At least one managed node group to run Karpenter itself (Karpenter cannot provision the node it runs on)
Tag Subnets and Security Groups
Karpenter discovers which subnets and security groups to use by looking for a specific tag. Without these tags, the EC2NodeClass has nothing to select and nodes will never launch.
Tag your private subnets:
aws ec2 create-tags \
--resources subnet-XXXXXXXXXXXXXXXXX subnet-YYYYYYYYYYYYYYYYY \
--tags Key=karpenter.sh/discovery,Value=CLUSTER_NAME
Tag the security group your nodes use:
aws ec2 create-tags \
--resources sg-XXXXXXXXXXXXXXXXX \
--tags Key=karpenter.sh/discovery,Value=CLUSTER_NAME
Create IAM Roles
Karpenter needs two IAM roles: a controller role (for the Karpenter pod itself) and a node role (for the EC2 instances it launches).
Node Role (KarpenterNodeRole)
The node role is what the launched EC2 instances assume. It needs the same policies as a regular EKS worker node.
aws iam create-role \
--role-name KarpenterNodeRole-CLUSTER_NAME \
--assume-role-policy-document '{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": {"Service": "ec2.amazonaws.com"},
"Action": "sts:AssumeRole"
}]
}'
aws iam attach-role-policy --role-name KarpenterNodeRole-CLUSTER_NAME \
--policy-arn arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
aws iam attach-role-policy --role-name KarpenterNodeRole-CLUSTER_NAME \
--policy-arn arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
aws iam attach-role-policy --role-name KarpenterNodeRole-CLUSTER_NAME \
--policy-arn arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
aws iam attach-role-policy --role-name KarpenterNodeRole-CLUSTER_NAME \
--policy-arn arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
Create the instance profile and add the role to it:
aws iam create-instance-profile --instance-profile-name KarpenterNodeInstanceProfile-CLUSTER_NAME
aws iam add-role-to-instance-profile \
--instance-profile-name KarpenterNodeInstanceProfile-CLUSTER_NAME \
--role-name KarpenterNodeRole-CLUSTER_NAME
Add the node role to the EKS aws-auth ConfigMap so the new nodes can join the cluster. If you are using access entries instead, create an access entry for the node role with type EC2_LINUX.
aws eks create-access-entry \
--cluster-name CLUSTER_NAME \
--principal-arn arn:aws:iam::ACCOUNT_ID:role/KarpenterNodeRole-CLUSTER_NAME \
--type EC2_LINUX
Controller Role
The controller role gives Karpenter permission to launch and terminate EC2 instances, manage spot interruption queues, and describe pricing. Pod Identity is the recommended path in v1.11. Create the role and associate it:
aws iam create-role \
--role-name KarpenterControllerRole-CLUSTER_NAME \
--assume-role-policy-document '{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": {"Service": "pods.eks.amazonaws.com"},
"Action": ["sts:AssumeRole", "sts:TagSession"]
}]
}'
Attach the Karpenter controller policy (create it from the official policy document). Then create the Pod Identity association:
aws eks create-pod-identity-association \
--cluster-name CLUSTER_NAME \
--namespace kube-system \
--service-account karpenter \
--role-arn arn:aws:iam::ACCOUNT_ID:role/KarpenterControllerRole-CLUSTER_NAME
If your cluster uses IRSA instead, create an OIDC trust policy for the role. The IRSA guide walks through that process.
Install Karpenter with Helm
Karpenter v1.11.1 ships as an OCI Helm chart from the public ECR registry. No need to add a Helm repo first.
helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter \
--version "1.11.1" \
--namespace kube-system \
--set "settings.clusterName=CLUSTER_NAME" \
--set "settings.interruptionQueueName=CLUSTER_NAME" \
--set "settings.clusterEndpoint=$(aws eks describe-cluster --name CLUSTER_NAME --query 'cluster.endpoint' --output text)" \
--wait
Verify the controller pod is running:
kubectl get pods -n kube-system -l app.kubernetes.io/name=karpenter
You should see the controller pod in Running state:
NAME READY STATUS RESTARTS AGE
karpenter-6f4b8d7c9f-x8k2p 1/1 Running 0 45s
Configure the NodePool
The NodePool CRD (API version karpenter.sh/v1) tells Karpenter what kind of nodes it can create. This is where you define instance families, capacity types, architecture, and consolidation behavior.
cat < nodepool.yaml
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: default
spec:
template:
spec:
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
- key: node.kubernetes.io/instance-type
operator: In
values:
- t3.medium
- t3.large
- t3a.medium
- t3a.large
- m5.large
- m5a.large
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 30s
limits:
cpu: "100"
memory: 200Gi
EOF
A few things worth noting in this spec. The limits section caps total capacity at 100 vCPUs and 200 GiB memory, which prevents runaway scaling if something goes wrong. The consolidationPolicy: WhenEmptyOrUnderutilized with a 30-second delay means Karpenter actively consolidates, not just when nodes are completely empty but also when it can bin-pack pods onto fewer nodes.
By listing both spot and on-demand in capacity types, Karpenter will prefer Spot for cost savings and fall back to On-Demand when Spot capacity is unavailable.
Configure the EC2NodeClass
The EC2NodeClass (API version karpenter.k8s.aws/v1) defines the AWS-specific settings: AMI, subnets, security groups, and the instance profile.
cat < ec2nodeclass.yaml
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: default
spec:
role: KarpenterNodeRole-CLUSTER_NAME
amiSelectorTerms:
- alias: al2023@latest
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: CLUSTER_NAME
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: CLUSTER_NAME
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 50Gi
volumeType: gp3
deleteOnTermination: true
EOF
The amiSelectorTerms with alias: al2023@latest tells Karpenter to always use the latest Amazon Linux 2023 EKS-optimized AMI. When AWS publishes a new AMI, Karpenter detects the drift and replaces nodes automatically (more on that later).
Apply both resources:
kubectl apply -f nodepool.yaml -f ec2nodeclass.yaml
Confirm they are created:
kubectl get nodepools,ec2nodeclasses
The output should show both resources with no errors in the status column:
NAME NODECLASS NODES READY AGE
nodepool.karpenter.sh/default default 0 True 10s
NAME READY AGE
ec2nodeclass.karpenter.k8s.aws/default True 10s
Test Scale-Up with an Inflate Deployment
The classic way to test Karpenter is to deploy pods that request enough resources to force new node provisioning. The pause container is perfect for this because it does nothing except consume the resources you request.
cat < inflate.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: inflate
spec:
replicas: 0
selector:
matchLabels:
app: inflate
template:
metadata:
labels:
app: inflate
spec:
containers:
- name: inflate
image: public.ecr.aws/eks-distro/kubernetes/pause:3.7
resources:
requests:
cpu: "1"
memory: 1Gi
EOF
kubectl apply -f inflate.yaml
Scale it to 5 replicas. Each pod requests 1 CPU and 1Gi of memory, so the existing managed nodes won’t have room:
kubectl scale deployment inflate --replicas=5
Watch the Karpenter controller logs. Within seconds you will see nodeclaim registration:
kubectl logs -n kube-system -l app.kubernetes.io/name=karpenter --tail=20
The logs show the full provisioning lifecycle:
{"level":"INFO","msg":"registered nodeclaim","NodeClaim":"default-abc12","provider-id":"aws:///eu-west-1a/i-0a1b2c3d4e5f67890"}
{"level":"INFO","msg":"initialized nodeclaim","NodeClaim":"default-abc12","allocatable":{"cpu":"1930m","memory":"3388Mi"}}
{"level":"INFO","msg":"registered nodeclaim","NodeClaim":"default-def34","provider-id":"aws:///eu-west-1b/i-0b2c3d4e5f678901a"}
{"level":"INFO","msg":"initialized nodeclaim","NodeClaim":"default-def34","allocatable":{"cpu":"1930m","memory":"3388Mi"}}
{"level":"INFO","msg":"registered nodeclaim","NodeClaim":"default-ghi56","provider-id":"aws:///eu-west-1a/i-0c3d4e5f67890123b"}
{"level":"INFO","msg":"initialized nodeclaim","NodeClaim":"default-ghi56","allocatable":{"cpu":"1930m","memory":"3388Mi"}}
Three t3a.medium Spot instances came up in about 30 seconds. Karpenter chose t3a.medium over t3.medium because it’s slightly cheaper per vCPU, and it picked Spot because capacity was available. Verify the new nodes joined the cluster:
kubectl get nodes -L karpenter.sh/capacity-type,node.kubernetes.io/instance-type
You should see the Karpenter-provisioned nodes alongside your managed node group:
NAME STATUS ROLES AGE VERSION CAPACITY-TYPE INSTANCE-TYPE
ip-10-0-1-50.eu-west-1.compute.internal Ready <none> 30s v1.33.8 spot t3a.medium
ip-10-0-1-51.eu-west-1.compute.internal Ready <none> 28s v1.33.8 spot t3a.medium
ip-10-0-1-52.eu-west-1.compute.internal Ready <none> 29s v1.33.8 spot t3a.medium
ip-10-0-2-10.eu-west-1.compute.internal Ready <none> 4h v1.33.8 on-demand t3.medium
Test Consolidation
Scale the inflate deployment back to zero and watch Karpenter reclaim the capacity:
kubectl scale deployment inflate --replicas=0
Within 90 seconds (30s consolidateAfter + node drain time), the logs show disruption in action:
{"level":"INFO","msg":"disrupting node(s) via delete, terminating 1 nodes (3 pods) ip-10-0-1-50.eu-west-1.compute.internal/t3a.medium/spot, savings: $0.04"}
{"level":"INFO","msg":"deleted node","Node":"ip-10-0-1-50.eu-west-1.compute.internal"}
{"level":"INFO","msg":"disrupting node(s) via delete, terminating 1 nodes (0 pods) ip-10-0-1-51.eu-west-1.compute.internal/t3a.medium/spot, savings: $0.04"}
{"level":"INFO","msg":"deleted node","Node":"ip-10-0-1-51.eu-west-1.compute.internal"}
{"level":"INFO","msg":"disrupting node(s) via delete, terminating 1 nodes (0 pods) ip-10-0-1-52.eu-west-1.compute.internal/t3a.medium/spot, savings: $0.04"}
{"level":"INFO","msg":"deleted node","Node":"ip-10-0-1-52.eu-west-1.compute.internal"}
All three Spot nodes were terminated. Karpenter even reports the per-node savings. On a cluster with dozens of underutilized nodes, this consolidation adds up fast.
Drift Detection
When you change the EC2NodeClass (new AMI alias, different security group, updated block device mapping) or when AWS publishes a new EKS-optimized AMI, Karpenter detects that existing nodes have “drifted” from the desired spec. It then gracefully cordons, drains, and replaces them.
This is one of Karpenter’s strongest operational advantages. With Cluster Autoscaler, you have to manually update launch templates and roll the node group. Karpenter handles it automatically, respecting Pod Disruption Budgets (PDBs) to avoid taking down too many pods at once.
Spot Instances and Interruption Handling
When you include spot in capacity types, Karpenter uses the price-capacity-optimized allocation strategy. This means AWS picks from pools that have both the lowest price and the highest available capacity, reducing the frequency of Spot interruptions compared to the older lowest-price strategy.
Karpenter also integrates with an SQS queue for Spot interruption notices. When AWS sends a 2-minute warning, Karpenter cordons and drains the affected node before the interruption hits. Set the queue name via settings.interruptionQueueName in the Helm values (we did this during installation).
Disruption Budgets
Consolidation and drift replacement are powerful, but you do not want Karpenter replacing all your nodes simultaneously during a traffic spike. Disruption budgets control how aggressively Karpenter can disrupt:
spec:
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 30s
budgets:
- nodes: "20%"
- nodes: "0"
schedule: "0 9 * * 1-5"
duration: 8h
This configuration allows Karpenter to disrupt up to 20% of nodes at any time, except during business hours (9 AM to 5 PM, Monday through Friday) when disruption is blocked entirely. Adjust these windows to match your traffic patterns.
Troubleshooting
Error: “panic: the Karpenter version is not supported on EKS version”
Karpenter v1.11.x requires EKS 1.28 or later. If your cluster is on an older version, the controller panics at startup. Upgrade your EKS control plane first, then install Karpenter.
Error: “AuthFailure: Not authorized to perform sts:AssumeRole”
The Pod Identity association is either missing or the role’s trust policy does not include pods.eks.amazonaws.com. Verify the association exists:
aws eks list-pod-identity-associations --cluster-name CLUSTER_NAME
If empty, recreate the association. If present, check the role’s trust policy allows the sts:AssumeRole and sts:TagSession actions from pods.eks.amazonaws.com.
Error: “DNS timeout resolving eks.eu-west-1.amazonaws.com”
The Karpenter pod cannot reach the EKS API. This usually means CoreDNS is not running or the security group blocks outbound traffic. Check that CoreDNS pods are healthy and that the node security group allows outbound HTTPS (port 443) to the EKS API endpoint.
Nodes stuck in NotReady: IP address exhaustion
Each pod on an EC2 instance consumes an ENI secondary IP. When the subnet runs out of IPs, new pods go to ContainerCreating indefinitely. Check available IPs in the subnet:
aws ec2 describe-subnets --subnet-ids subnet-XXXXXXXXXXXXXXXXX \
--query 'Subnets[0].AvailableIpAddressCount'
If the count is low, either use larger subnets (/20 or bigger), enable prefix delegation on the VPC CNI, or reduce the number of instance types that consume many IPs per node.
Cleanup
Remove the test deployment and Karpenter resources in order:
kubectl delete deployment inflate
kubectl delete nodepool default
kubectl delete ec2nodeclass default
Wait for all Karpenter-managed nodes to terminate (check with kubectl get nodes), then uninstall the Helm release:
helm uninstall karpenter -n kube-system
Delete the IAM roles and instance profile if you no longer need them:
aws iam remove-role-from-instance-profile \
--instance-profile-name KarpenterNodeInstanceProfile-CLUSTER_NAME \
--role-name KarpenterNodeRole-CLUSTER_NAME
aws iam delete-instance-profile --instance-profile-name KarpenterNodeInstanceProfile-CLUSTER_NAME
aws iam detach-role-policy --role-name KarpenterNodeRole-CLUSTER_NAME \
--policy-arn arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy
aws iam detach-role-policy --role-name KarpenterNodeRole-CLUSTER_NAME \
--policy-arn arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy
aws iam detach-role-policy --role-name KarpenterNodeRole-CLUSTER_NAME \
--policy-arn arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly
aws iam detach-role-policy --role-name KarpenterNodeRole-CLUSTER_NAME \
--policy-arn arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
aws iam delete-role --role-name KarpenterNodeRole-CLUSTER_NAME
aws iam delete-role --role-name KarpenterControllerRole-CLUSTER_NAME
FAQ
Can Karpenter and Cluster Autoscaler run on the same cluster?
Yes, but they should manage different node groups. Karpenter manages nodes it provisions (via NodePool), while Cluster Autoscaler manages ASG-backed managed node groups. They will not conflict as long as you do not point both at the same group of nodes.
Does Karpenter work with Fargate?
No. Karpenter provisions EC2 instances. Fargate profiles are a separate scheduling mechanism managed by AWS. You can use both on the same cluster, but they serve different workloads.
How does Karpenter choose between Spot and On-Demand?
When both capacity types are allowed in the NodePool, Karpenter prefers Spot because it is cheaper. If the Spot fleet API returns insufficient capacity for the requested instance types, Karpenter falls back to On-Demand automatically. You can also force On-Demand only by removing spot from the requirements.
What happens to pods during consolidation?
Karpenter cordons the node, then drains it by evicting pods. Pods with PodDisruptionBudgets are respected. If a PDB would be violated, Karpenter skips that node until the budget allows disruption. The replacement pods are scheduled on remaining nodes or trigger new nodes if needed.
How do I restrict Karpenter to specific Availability Zones?
Add a topology requirement to the NodePool spec:
- key: topology.kubernetes.io/zone
operator: In
values: ["eu-west-1a", "eu-west-1b"]
Karpenter will only launch nodes in those zones. This is useful for workloads that depend on EBS volumes in specific AZs.