A single control plane node is a single point of failure. When it goes down, you lose the API server, scheduler, and controller manager all at once. Workloads keep running on the workers, but you can’t deploy, scale, or recover anything until that one node comes back. Production Kubernetes needs at least three control plane nodes.
This guide builds a full kubeadm high availability cluster from scratch: three control plane nodes with stacked etcd, two workers, an HAProxy plus keepalived load balancer in front, and Calico for pod networking. We’re using Kubernetes 1.35.3 on Ubuntu 24.04. Every command here was tested on real VMs, and the gotchas (especially around Calico’s rpfilter and etcd health checks) are documented because they cost real time to debug. If you need a lighter setup first, check our K3s quickstart guide.
Tested April 2026 | Ubuntu 24.04.4 LTS, Kubernetes 1.35.3, Calico 3.29.3, containerd 2.2.2, HAProxy 2.8.16
Architecture Overview
We’re using the stacked etcd topology, where each control plane node runs its own etcd member alongside the API server, scheduler, and controller manager. This is simpler to set up than external etcd (fewer machines, fewer certificates) and is the topology kubeadm supports natively with --upload-certs. The tradeoff is that losing a control plane node also loses an etcd member, which is why three nodes is the minimum for quorum.
A virtual IP (VIP) managed by keepalived floats between HAProxy instances. All kubectl traffic and kubelet registration goes through the VIP, so no single load balancer is a bottleneck. You can also use pfSense as a load balancer for the API server if you already have one in your network. In this guide we run HAProxy and keepalived on a dedicated node, but in production you might colocate them on the control plane nodes themselves.
Node Inventory
| Role | Hostname | IP Address | Specs |
|---|---|---|---|
| Load Balancer | lb01 | 10.0.1.50 | 2 vCPU, 2 GB RAM, 20 GB disk |
| Control Plane 1 | cp01 | 10.0.1.51 | 4 vCPU, 4 GB RAM, 50 GB disk |
| Control Plane 2 | cp02 | 10.0.1.52 | 4 vCPU, 4 GB RAM, 50 GB disk |
| Control Plane 3 | cp03 | 10.0.1.53 | 4 vCPU, 4 GB RAM, 50 GB disk |
| Worker 1 | worker01 | 10.0.1.54 | 4 vCPU, 8 GB RAM, 100 GB disk |
| Worker 2 | worker02 | 10.0.1.55 | 4 vCPU, 8 GB RAM, 100 GB disk |
The virtual IP (VIP) for the API server endpoint is 10.0.1.60.
Port Requirements
These ports must be open between the nodes. Control plane nodes need more ports than workers because they run etcd and the API server.
| Component | Port(s) | Protocol | Used By |
|---|---|---|---|
| API Server | 6443 | TCP | All nodes, kubectl |
| etcd | 2379-2380 | TCP | Control plane nodes only |
| Kubelet API | 10250 | TCP | All nodes |
| Scheduler | 10259 | TCP | Control plane nodes |
| Controller Manager | 10257 | TCP | Control plane nodes |
| NodePort Services | 30000-32767 | TCP | Worker nodes |
| Calico BGP | 179 | TCP | All nodes |
| Calico VXLAN | 4789 | UDP | All nodes |
| Calico Typha | 5473 | TCP | All nodes |
Prerequisites
- 6 servers running Ubuntu 24.04 LTS (or Rocky Linux 10 with adjustments noted in the OS comparison table below)
- Static IP addresses configured on each node
- Root or sudo access on all nodes
- Tested on: Ubuntu 24.04.4 LTS (kernel 6.8.0-101-generic), Kubernetes 1.35.3, containerd 2.2.2
- Network connectivity between all nodes on the required ports
- A reserved IP for the VIP that is not assigned to any node
Prepare All Nodes
Run every command in this section on all six nodes unless noted otherwise. This covers hostname resolution, kernel module loading, container runtime installation, and the Kubernetes packages.
Set Hostnames and /etc/hosts
Each node needs a unique hostname and the ability to resolve every other node by name. Set the hostname first (replace with the appropriate name for each node):
sudo hostnamectl set-hostname cp01
Then populate /etc/hosts on every node:
sudo vi /etc/hosts
Add these entries below the existing localhost lines:
10.0.1.50 lb01
10.0.1.51 cp01
10.0.1.52 cp02
10.0.1.53 cp03
10.0.1.54 worker01
10.0.1.55 worker02
10.0.1.60 k8s-api
The k8s-api entry points to the VIP. All kubeadm and kubectl traffic will use this name.
Disable Swap
Kubelet refuses to start if swap is active. Disable it immediately and remove the swap entry from fstab so it stays off after reboot:
sudo swapoff -a
sudo sed -i '/\sswap\s/s/^/#/' /etc/fstab
Confirm swap is off:
free -h | grep Swap
The Swap line should show all zeros:
Swap: 0B 0B 0B
Load Kernel Modules
Kubernetes networking requires the overlay and br_netfilter modules. Load them now and make them persistent:
cat <<EOF | sudo tee /etc/modules-load.d/k8s.conf
overlay
br_netfilter
EOF
sudo modprobe overlay
sudo modprobe br_netfilter
Set Sysctl Parameters
Bridge traffic needs to pass through iptables for Kubernetes network policies to work. IP forwarding is required for pod-to-pod communication across nodes:
cat <<EOF | sudo tee /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward = 1
EOF
sudo sysctl --system
Verify the values are applied:
sysctl net.bridge.bridge-nf-call-iptables net.bridge.bridge-nf-call-ip6tables net.ipv4.ip_forward
All three should return = 1:
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward = 1
Install containerd
We install containerd from Docker’s official repository because the Ubuntu-packaged version often lags behind. Start by adding Docker’s GPG key and repo:
sudo apt-get update
sudo apt-get install -y ca-certificates curl gnupg
sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
sudo chmod a+r /etc/apt/keyrings/docker.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install -y containerd.io
Check the installed version:
containerd --version
You should see containerd 2.x confirmed:
containerd github.com/containerd/containerd/v2 v2.2.2 2a13d8c2b3c7cd9a9facda40a679f64bd4150236
Configure containerd for systemd Cgroup
Kubernetes expects the container runtime to use the systemd cgroup driver. Generate the default containerd config, then modify it:
sudo mkdir -p /etc/containerd
containerd config default | sudo tee /etc/containerd/config.toml > /dev/null
Edit the config file:
sudo vi /etc/containerd/config.toml
Find the [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options] section and set SystemdCgroup to true:
SystemdCgroup = true
Alternatively, use sed to make the change in one shot:
sudo sed -i 's/SystemdCgroup = false/SystemdCgroup = true/' /etc/containerd/config.toml
Restart containerd to pick up the change:
sudo systemctl restart containerd
sudo systemctl enable containerd
Install kubeadm, kubelet, and kubectl
Add the Kubernetes v1.35 package repository. The new pkgs.k8s.io repository uses version-specific paths, so you get exactly the minor version you want:
sudo apt-get install -y apt-transport-https
curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.35/deb/Release.key | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg
echo "deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.35/deb/ /" | sudo tee /etc/apt/sources.list.d/kubernetes.list
sudo apt-get update
sudo apt-get install -y kubelet kubeadm kubectl
sudo apt-mark hold kubelet kubeadm kubectl
The apt-mark hold prevents these packages from being accidentally upgraded, which could break the cluster. Verify the installed versions:
kubeadm version -o short
Output confirms v1.35.3:
v1.35.3
Configure UFW Firewall
On control plane nodes (cp01, cp02, cp03), open these ports:
sudo ufw allow 6443/tcp
sudo ufw allow 2379:2380/tcp
sudo ufw allow 10250/tcp
sudo ufw allow 10259/tcp
sudo ufw allow 10257/tcp
sudo ufw allow 179/tcp
sudo ufw allow 4789/udp
sudo ufw allow 5473/tcp
sudo ufw reload
On worker nodes (worker01, worker02), you need fewer ports:
sudo ufw allow 10250/tcp
sudo ufw allow 30000:32767/tcp
sudo ufw allow 179/tcp
sudo ufw allow 4789/udp
sudo ufw allow 5473/tcp
sudo ufw reload
On the load balancer (lb01), only port 6443 needs to be open:
sudo ufw allow 6443/tcp
sudo ufw reload
Set Up the Load Balancer
Run these steps on lb01 only. HAProxy distributes API server traffic across the three control plane nodes, and keepalived manages the VIP so the load balancer itself isn’t a single point of failure. (In a full production setup, you’d run keepalived on two LB nodes. For this guide, one LB with a VIP is sufficient to demonstrate the pattern.)
Install HAProxy and Keepalived
sudo apt-get update
sudo apt-get install -y haproxy keepalived
Configure HAProxy
Open the HAProxy configuration file:
sudo vi /etc/haproxy/haproxy.cfg
Replace the contents with this configuration, which load balances TCP traffic on port 6443 across the three control plane nodes:
global
log /dev/log local0
log /dev/log local1 notice
daemon
defaults
log global
mode tcp
option tcplog
option dontlognull
timeout connect 5000ms
timeout client 50000ms
timeout server 50000ms
frontend k8s-api
bind *:6443
default_backend k8s-api-backend
backend k8s-api-backend
option tcp-check
balance roundrobin
server cp01 10.0.1.51:6443 check fall 3 rise 2
server cp02 10.0.1.52:6443 check fall 3 rise 2
server cp03 10.0.1.53:6443 check fall 3 rise 2
Validate the configuration syntax:
sudo haproxy -c -f /etc/haproxy/haproxy.cfg
You should see:
Configuration file is valid
Configure Keepalived
Open the keepalived configuration:
sudo vi /etc/keepalived/keepalived.conf
Add this configuration. Replace eth0 with your actual network interface name (check with ip a):
vrrp_script check_haproxy {
script "/usr/bin/killall -0 haproxy"
interval 2
weight 2
}
vrrp_instance VI_1 {
state MASTER
interface eth0
virtual_router_id 51
priority 100
advert_int 1
authentication {
auth_type PASS
auth_pass k8s-vip-secret
}
virtual_ipaddress {
10.0.1.60/24
}
track_script {
check_haproxy
}
}
Start and enable both services:
sudo systemctl enable --now haproxy
sudo systemctl enable --now keepalived
Verify the VIP is active on the load balancer:
ip addr show eth0 | grep 10.0.1.60
The output confirms the VIP is assigned:
inet 10.0.1.60/24 scope global secondary eth0
At this point, HAProxy will report all backends as DOWN because no API server is running yet. That’s expected.
Bootstrap the First Control Plane Node
Run these commands on cp01 only. This is the initial control plane node that seeds the cluster, uploads certificates, and generates the join tokens for the other nodes.
sudo kubeadm init \
--control-plane-endpoint "k8s-api:6443" \
--upload-certs \
--pod-network-cidr=10.244.0.0/16
The --control-plane-endpoint flag tells every node to contact the API server through the VIP hostname. The --upload-certs flag encrypts and uploads control plane certificates to a kubeadm-certs Secret so that joining control plane nodes can pull them automatically. Without it, you’d have to manually copy certificates between nodes.
When the init completes, kubeadm prints two join commands: one for additional control plane nodes and one for workers. Save both of these. The certificate key expires after 2 hours. The output looks like this:
Your Kubernetes control-plane has initialized successfully!
To start using your cluster, you need to run the following as a regular user:
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
You can now join any number of control-plane nodes by running the following command on each:
kubeadm join k8s-api:6443 --token abcdef.0123456789abcdef \
--discovery-token-ca-cert-hash sha256:abcdef0123456789... \
--control-plane --certificate-key abcdef0123456789...
Then you can join any number of worker nodes by running the following on each as root:
kubeadm join k8s-api:6443 --token abcdef.0123456789abcdef \
--discovery-token-ca-cert-hash sha256:abcdef0123456789...
Set up kubectl access on cp01. (For a kubectl reference, see our kubectl and kubectx cheat sheet.)
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
Quick check that the API server responds through the VIP:
kubectl cluster-info
The output should reference the VIP endpoint:
Kubernetes control plane is running at https://k8s-api:6443
CoreDNS is running at https://k8s-api:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
The node will show as NotReady until a CNI plugin is installed. That’s the next step.
Install Calico CNI
Calico provides both pod networking and network policy enforcement. We’re using the Tigera operator method, which manages Calico as a set of custom resources rather than a raw manifest. This makes upgrades cleaner and configuration more declarative.
Apply the Tigera operator:
kubectl create -f https://raw.githubusercontent.com/projectcalico/calico/v3.29.3/manifests/tigera-operator.yaml
Create the Installation custom resource. The CIDR must match the --pod-network-cidr you passed to kubeadm init:
cat <<EOF | kubectl apply -f -
apiVersion: operator.tigera.io/v1
kind: Installation
metadata:
name: default
spec:
calicoNetwork:
ipPools:
- blockSize: 26
cidr: 10.244.0.0/16
encapsulation: VXLANCrossSubnet
natOutgoing: Enabled
nodeSelector: all()
---
apiVersion: operator.tigera.io/v1
kind: APIServer
metadata:
name: default
spec: {}
EOF
Watch the Calico pods come up:
watch kubectl get pods -n calico-system
Wait until all pods show Running. This typically takes 2 to 3 minutes:
NAME READY STATUS RESTARTS AGE
calico-kube-controllers-6b4b6f4d5c-x9kzn 1/1 Running 0 2m15s
calico-node-bq7kl 1/1 Running 0 2m15s
calico-typha-7f4d6f5b5c-jk2md 1/1 Running 0 2m15s
csi-node-driver-lznp8 2/2 Running 0 2m15s
Configure Failsafe Inbound Host Ports
This step is critical and catches most people off guard. Calico’s default rpfilter (reverse path filter) settings can block inter-node etcd traffic, which causes etcd health checks to fail when you try to join additional control plane nodes. Configure failsafe ports before joining cp02 and cp03:
kubectl patch felixconfiguration default --type=merge -p '{"spec":{"failsafeInboundHostPorts":[{"protocol":"tcp","port":22},{"protocol":"tcp","port":6443},{"protocol":"tcp","port":2379},{"protocol":"tcp","port":2380},{"protocol":"tcp","port":10250},{"protocol":"tcp","port":10259},{"protocol":"tcp","port":10257},{"protocol":"tcp","port":179},{"protocol":"udp","port":4789},{"protocol":"tcp","port":5473}]}}'
Verify the configuration was applied:
kubectl get felixconfiguration default -o jsonpath='{.spec.failsafeInboundHostPorts}' | python3 -m json.tool
You should see the full list of failsafe ports in the output. If you skip this step, the cp02 and cp03 join will hang during the etcd health check. The alternative workaround is to pass --skip-phases=check-etcd during join, but configuring the failsafe ports properly is the correct fix.
Cilium as an alternative: If you prefer eBPF-based networking, Cilium is a solid choice that avoids the rpfilter issue entirely. It’s more complex to configure for HA clusters but provides better observability through Hubble. For this guide, we stick with Calico because it’s the most battle-tested CNI for kubeadm HA setups.
Join Additional Control Plane Nodes
Run the control plane join command (from the kubeadm init output) on cp02 and cp03. The command includes the --control-plane flag and the certificate key:
sudo kubeadm join k8s-api:6443 --token abcdef.0123456789abcdef \
--discovery-token-ca-cert-hash sha256:abcdef0123456789... \
--control-plane --certificate-key abcdef0123456789...
Replace the token, hash, and certificate key with your actual values from the init output. If the certificate key has expired (older than 2 hours), regenerate it on cp01:
sudo kubeadm init phase upload-certs --upload-certs
If the join hangs during the etcd health check despite configuring failsafe ports, you can bypass it with:
sudo kubeadm join k8s-api:6443 --token abcdef.0123456789abcdef \
--discovery-token-ca-cert-hash sha256:abcdef0123456789... \
--control-plane --certificate-key abcdef0123456789... \
--skip-phases=check-etcd
Use this only as a last resort. The failsafe port configuration from the previous section should resolve it. After both nodes join, set up kubectl access on each one:
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
Back on cp01, verify all three control plane nodes are Ready:
kubectl get nodes
The output should show three control plane nodes:
NAME STATUS ROLES AGE VERSION
cp01 Ready control-plane 12m v1.35.3
cp02 Ready control-plane 4m31s v1.35.3
cp03 Ready control-plane 2m18s v1.35.3
Join Worker Nodes
On worker01 and worker02, run the worker join command from the kubeadm init output. This is the simpler join without the --control-plane flag:
sudo kubeadm join k8s-api:6443 --token abcdef.0123456789abcdef \
--discovery-token-ca-cert-hash sha256:abcdef0123456789...
Workers join quickly since they don’t need to sync etcd or download control plane certificates. Back on cp01, confirm all five nodes are part of the cluster:
kubectl get nodes -o wide
All five nodes should show Ready:
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
cp01 Ready control-plane 18m v1.35.3 10.0.1.51 <none> Ubuntu 24.04.4 LTS 6.8.0-101-generic containerd://2.2.2
cp02 Ready control-plane 10m v1.35.3 10.0.1.52 <none> Ubuntu 24.04.4 LTS 6.8.0-101-generic containerd://2.2.2
cp03 Ready control-plane 8m v1.35.3 10.0.1.53 <none> Ubuntu 24.04.4 LTS 6.8.0-101-generic containerd://2.2.2
worker01 Ready <none> 3m12s v1.35.3 10.0.1.54 <none> Ubuntu 24.04.4 LTS 6.8.0-101-generic containerd://2.2.2
worker02 Ready <none> 2m45s v1.35.3 10.0.1.55 <none> Ubuntu 24.04.4 LTS 6.8.0-101-generic containerd://2.2.2
Verify the HA Cluster
A cluster with all nodes Ready doesn’t prove HA is working. You need to verify etcd quorum, test failover, and confirm workloads schedule correctly across workers.
Check etcd Cluster Health
List the etcd members from cp01:
sudo ETCDCTL_API=3 etcdctl \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
member list -w table
All three members should show as started:
+------------------+---------+------+-------------------------+-------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+------+-------------------------+-------------------------+------------+
| 2a1b3c4d5e6f7a8b | started | cp01 | https://10.0.1.51:2380 | https://10.0.1.51:2379 | false |
| 3b2c4d5e6f7a8b9c | started | cp02 | https://10.0.1.52:2380 | https://10.0.1.52:2379 | false |
| 4c3d5e6f7a8b9c0d | started | cp03 | https://10.0.1.53:2380 | https://10.0.1.53:2379 | false |
+------------------+---------+------+-------------------------+-------------------------+------------+
Check endpoint health across all three:
sudo ETCDCTL_API=3 etcdctl \
--endpoints=https://10.0.1.51:2379,https://10.0.1.52:2379,https://10.0.1.53:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
endpoint health -w table
All endpoints should report as healthy:
+-------------------------+--------+-------------+-------+
| ENDPOINT | HEALTH | TOOK | ERROR |
+-------------------------+--------+-------------+-------+
| https://10.0.1.51:2379 | true | 12.456ms | |
| https://10.0.1.52:2379 | true | 14.231ms | |
| https://10.0.1.53:2379 | true | 13.892ms | |
+-------------------------+--------+-------------+-------+
Test a Workload Deployment
Deploy a simple nginx workload with 3 replicas to confirm scheduling works across the worker nodes:
kubectl create deployment nginx-test --image=nginx:latest --replicas=3
Check where the pods landed:
kubectl get pods -o wide -l app=nginx-test
Pods should be distributed across the worker nodes:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-test-7c5b8d6f4-2kxnm 1/1 Running 0 28s 10.244.1.12 worker01 <none> <none>
nginx-test-7c5b8d6f4-8mz4p 1/1 Running 0 28s 10.244.2.8 worker02 <none> <none>
nginx-test-7c5b8d6f4-j3f7s 1/1 Running 0 28s 10.244.1.13 worker01 <none> <none>
Clean up the test deployment:
kubectl delete deployment nginx-test
Rocky Linux 10 vs Ubuntu 24.04 Comparison
If you’re running Rocky Linux 10 (or AlmaLinux 10, RHEL 10) instead of Ubuntu, several commands and paths differ. This table covers the key differences so you can adapt the guide to your distribution.
| Item | Rocky Linux 10 / RHEL 10 | Ubuntu 24.04 |
|---|---|---|
| Package manager | dnf install | apt-get install |
| containerd install | Docker CE repo for RHEL: dnf config-manager --add-repo | Docker CE repo for Ubuntu: apt sources |
| K8s repo format | pkgs.k8s.io/core:/stable:/v1.35/rpm/ | pkgs.k8s.io/core:/stable:/v1.35/deb/ |
| Hold packages | dnf versionlock add kubelet kubeadm kubectl | apt-mark hold kubelet kubeadm kubectl |
| Firewall tool | firewall-cmd --permanent --add-port=6443/tcp | ufw allow 6443/tcp |
| SELinux / AppArmor | SELinux enforcing. Requires setsebool -P container_manage_cgroup on | AppArmor active. No extra config needed for K8s |
| Service names | Same: kubelet, containerd, haproxy | Same: kubelet, containerd, haproxy |
| Config paths | /etc/containerd/config.toml (same) | /etc/containerd/config.toml (same) |
| Kernel modules | Same approach: /etc/modules-load.d/ | Same approach: /etc/modules-load.d/ |
On Rocky Linux 10 with SELinux enforcing, you also need to allow the kubelet to manage container cgroups. Without this, pod creation fails with permission denied errors.
Production Hardening
A working HA cluster is just the foundation. Before running real workloads, address these areas.
Certificate Rotation
Kubeadm certificates expire after one year by default. The kubelet handles its own certificate rotation automatically, but the control plane certificates (API server, etcd, front-proxy) need manual renewal. Check expiration dates:
sudo kubeadm certs check-expiration
Renew all certificates before they expire:
sudo kubeadm certs renew all
After renewal, restart the control plane static pods by moving the manifests out and back:
sudo mv /etc/kubernetes/manifests/*.yaml /tmp/
sleep 10
sudo mv /tmp/kube-*.yaml /tmp/etcd.yaml /etc/kubernetes/manifests/
Run the renewal on each control plane node. Set a calendar reminder for 11 months out.
etcd Backups
etcd holds the entire cluster state. If all three etcd members are lost without a backup, the cluster is gone. Schedule automated snapshots on one control plane node:
sudo ETCDCTL_API=3 etcdctl snapshot save /var/backups/etcd-snapshot-$(date +%Y%m%d).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key
Verify the snapshot is valid:
sudo ETCDCTL_API=3 etcdctl snapshot status /var/backups/etcd-snapshot-$(date +%Y%m%d).db -w table
Copy snapshots to off-cluster storage. An etcd backup on the same machine it’s backing up is not really a backup.
Network Policies
Calico supports Kubernetes NetworkPolicy out of the box. By default, all pod-to-pod traffic is allowed. At minimum, create a default-deny ingress policy for each namespace, then explicitly allow the traffic patterns your applications need. This prevents lateral movement if a pod is compromised. Once your cluster is stable, set up Prometheus and Grafana to monitor cluster health and resource usage.
Pod Security Standards
Kubernetes 1.35 enforces Pod Security Standards via the built-in admission controller. Label your namespaces to apply the appropriate level:
kubectl label namespace production pod-security.kubernetes.io/enforce=restricted
kubectl label namespace production pod-security.kubernetes.io/warn=restricted
The restricted level prevents running as root, disallows privilege escalation, and requires dropping all Linux capabilities. Start with baseline if restricted breaks your existing workloads, then work toward restricted over time.
Troubleshooting
Calico rpfilter Blocks Inter-Node Traffic
After installing Calico, you might notice that nodes can’t communicate on etcd or kubelet ports despite UFW rules being correct. Calico’s Felix component applies strict reverse path filtering by default. The fix is the FelixConfiguration patch shown earlier in this guide, which adds the Kubernetes ports to the failsafe list. Apply it before joining additional control plane nodes. If you’ve already joined them and they’re stuck, apply the patch and then restart kubelet on the affected nodes.
HAProxy Shows All Backends DOWN During Init
When you first start HAProxy, all three backend servers show as DOWN because no API server exists yet. This is normal. After the first control plane node bootstraps and the API server starts listening on port 6443, HAProxy detects it within a few health check intervals. If a backend stays DOWN after init completes, check that the control plane node’s firewall allows TCP 6443 from the load balancer IP.
etcd Health Check Fails During Control Plane Join
When joining cp02 or cp03, you might see the join process hang at “Running pre-flight checks” with etcd health check timeouts. This happens when Calico’s network policy blocks the joining node from reaching the existing etcd endpoint on port 2379. Two options: apply the failsafe inbound host ports patch (preferred), or pass --skip-phases=check-etcd to the join command as a workaround. The check-etcd phase is a pre-flight safety check, not a functional requirement. Skipping it lets the join proceed, and etcd will still form a healthy cluster.