Containers

Set Up a Highly Available Kubernetes Cluster with kubeadm

A single control plane node is a single point of failure. When it goes down, you lose the API server, scheduler, and controller manager all at once. Workloads keep running on the workers, but you can’t deploy, scale, or recover anything until that one node comes back. Production Kubernetes needs at least three control plane nodes.

Original content from computingforgeeks.com - post 165241

This guide builds a full kubeadm high availability cluster from scratch: three control plane nodes with stacked etcd, two workers, an HAProxy plus keepalived load balancer in front, and Calico for pod networking. We’re using Kubernetes 1.35.3 on Ubuntu 24.04. Every command here was tested on real VMs, and the gotchas (especially around Calico’s rpfilter and etcd health checks) are documented because they cost real time to debug. If you need a lighter setup first, check our K3s quickstart guide.

Tested April 2026 | Ubuntu 24.04.4 LTS, Kubernetes 1.35.3, Calico 3.29.3, containerd 2.2.2, HAProxy 2.8.16

Architecture Overview

We’re using the stacked etcd topology, where each control plane node runs its own etcd member alongside the API server, scheduler, and controller manager. This is simpler to set up than external etcd (fewer machines, fewer certificates) and is the topology kubeadm supports natively with --upload-certs. The tradeoff is that losing a control plane node also loses an etcd member, which is why three nodes is the minimum for quorum.

A virtual IP (VIP) managed by keepalived floats between HAProxy instances. All kubectl traffic and kubelet registration goes through the VIP, so no single load balancer is a bottleneck. You can also use pfSense as a load balancer for the API server if you already have one in your network. In this guide we run HAProxy and keepalived on a dedicated node, but in production you might colocate them on the control plane nodes themselves.

Node Inventory

RoleHostnameIP AddressSpecs
Load Balancerlb0110.0.1.502 vCPU, 2 GB RAM, 20 GB disk
Control Plane 1cp0110.0.1.514 vCPU, 4 GB RAM, 50 GB disk
Control Plane 2cp0210.0.1.524 vCPU, 4 GB RAM, 50 GB disk
Control Plane 3cp0310.0.1.534 vCPU, 4 GB RAM, 50 GB disk
Worker 1worker0110.0.1.544 vCPU, 8 GB RAM, 100 GB disk
Worker 2worker0210.0.1.554 vCPU, 8 GB RAM, 100 GB disk

The virtual IP (VIP) for the API server endpoint is 10.0.1.60.

Port Requirements

These ports must be open between the nodes. Control plane nodes need more ports than workers because they run etcd and the API server.

ComponentPort(s)ProtocolUsed By
API Server6443TCPAll nodes, kubectl
etcd2379-2380TCPControl plane nodes only
Kubelet API10250TCPAll nodes
Scheduler10259TCPControl plane nodes
Controller Manager10257TCPControl plane nodes
NodePort Services30000-32767TCPWorker nodes
Calico BGP179TCPAll nodes
Calico VXLAN4789UDPAll nodes
Calico Typha5473TCPAll nodes

Prerequisites

  • 6 servers running Ubuntu 24.04 LTS (or Rocky Linux 10 with adjustments noted in the OS comparison table below)
  • Static IP addresses configured on each node
  • Root or sudo access on all nodes
  • Tested on: Ubuntu 24.04.4 LTS (kernel 6.8.0-101-generic), Kubernetes 1.35.3, containerd 2.2.2
  • Network connectivity between all nodes on the required ports
  • A reserved IP for the VIP that is not assigned to any node

Prepare All Nodes

Run every command in this section on all six nodes unless noted otherwise. This covers hostname resolution, kernel module loading, container runtime installation, and the Kubernetes packages.

Set Hostnames and /etc/hosts

Each node needs a unique hostname and the ability to resolve every other node by name. Set the hostname first (replace with the appropriate name for each node):

sudo hostnamectl set-hostname cp01

Then populate /etc/hosts on every node:

sudo vi /etc/hosts

Add these entries below the existing localhost lines:

10.0.1.50 lb01
10.0.1.51 cp01
10.0.1.52 cp02
10.0.1.53 cp03
10.0.1.54 worker01
10.0.1.55 worker02
10.0.1.60 k8s-api

The k8s-api entry points to the VIP. All kubeadm and kubectl traffic will use this name.

Disable Swap

Kubelet refuses to start if swap is active. Disable it immediately and remove the swap entry from fstab so it stays off after reboot:

sudo swapoff -a
sudo sed -i '/\sswap\s/s/^/#/' /etc/fstab

Confirm swap is off:

free -h | grep Swap

The Swap line should show all zeros:

Swap:            0B          0B          0B

Load Kernel Modules

Kubernetes networking requires the overlay and br_netfilter modules. Load them now and make them persistent:

cat <<EOF | sudo tee /etc/modules-load.d/k8s.conf
overlay
br_netfilter
EOF

sudo modprobe overlay
sudo modprobe br_netfilter

Set Sysctl Parameters

Bridge traffic needs to pass through iptables for Kubernetes network policies to work. IP forwarding is required for pod-to-pod communication across nodes:

cat <<EOF | sudo tee /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-iptables  = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward                 = 1
EOF

sudo sysctl --system

Verify the values are applied:

sysctl net.bridge.bridge-nf-call-iptables net.bridge.bridge-nf-call-ip6tables net.ipv4.ip_forward

All three should return = 1:

net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward = 1

Install containerd

We install containerd from Docker’s official repository because the Ubuntu-packaged version often lags behind. Start by adding Docker’s GPG key and repo:

sudo apt-get update
sudo apt-get install -y ca-certificates curl gnupg

sudo install -m 0755 -d /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
sudo chmod a+r /etc/apt/keyrings/docker.gpg

echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

sudo apt-get update
sudo apt-get install -y containerd.io

Check the installed version:

containerd --version

You should see containerd 2.x confirmed:

containerd github.com/containerd/containerd/v2 v2.2.2 2a13d8c2b3c7cd9a9facda40a679f64bd4150236

Configure containerd for systemd Cgroup

Kubernetes expects the container runtime to use the systemd cgroup driver. Generate the default containerd config, then modify it:

sudo mkdir -p /etc/containerd
containerd config default | sudo tee /etc/containerd/config.toml > /dev/null

Edit the config file:

sudo vi /etc/containerd/config.toml

Find the [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options] section and set SystemdCgroup to true:

SystemdCgroup = true

Alternatively, use sed to make the change in one shot:

sudo sed -i 's/SystemdCgroup = false/SystemdCgroup = true/' /etc/containerd/config.toml

Restart containerd to pick up the change:

sudo systemctl restart containerd
sudo systemctl enable containerd

Install kubeadm, kubelet, and kubectl

Add the Kubernetes v1.35 package repository. The new pkgs.k8s.io repository uses version-specific paths, so you get exactly the minor version you want:

sudo apt-get install -y apt-transport-https

curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.35/deb/Release.key | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg

echo "deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.35/deb/ /" | sudo tee /etc/apt/sources.list.d/kubernetes.list

sudo apt-get update
sudo apt-get install -y kubelet kubeadm kubectl
sudo apt-mark hold kubelet kubeadm kubectl

The apt-mark hold prevents these packages from being accidentally upgraded, which could break the cluster. Verify the installed versions:

kubeadm version -o short

Output confirms v1.35.3:

v1.35.3

Configure UFW Firewall

On control plane nodes (cp01, cp02, cp03), open these ports:

sudo ufw allow 6443/tcp
sudo ufw allow 2379:2380/tcp
sudo ufw allow 10250/tcp
sudo ufw allow 10259/tcp
sudo ufw allow 10257/tcp
sudo ufw allow 179/tcp
sudo ufw allow 4789/udp
sudo ufw allow 5473/tcp
sudo ufw reload

On worker nodes (worker01, worker02), you need fewer ports:

sudo ufw allow 10250/tcp
sudo ufw allow 30000:32767/tcp
sudo ufw allow 179/tcp
sudo ufw allow 4789/udp
sudo ufw allow 5473/tcp
sudo ufw reload

On the load balancer (lb01), only port 6443 needs to be open:

sudo ufw allow 6443/tcp
sudo ufw reload

Set Up the Load Balancer

Run these steps on lb01 only. HAProxy distributes API server traffic across the three control plane nodes, and keepalived manages the VIP so the load balancer itself isn’t a single point of failure. (In a full production setup, you’d run keepalived on two LB nodes. For this guide, one LB with a VIP is sufficient to demonstrate the pattern.)

Install HAProxy and Keepalived

sudo apt-get update
sudo apt-get install -y haproxy keepalived

Configure HAProxy

Open the HAProxy configuration file:

sudo vi /etc/haproxy/haproxy.cfg

Replace the contents with this configuration, which load balances TCP traffic on port 6443 across the three control plane nodes:

global
    log /dev/log local0
    log /dev/log local1 notice
    daemon

defaults
    log     global
    mode    tcp
    option  tcplog
    option  dontlognull
    timeout connect 5000ms
    timeout client  50000ms
    timeout server  50000ms

frontend k8s-api
    bind *:6443
    default_backend k8s-api-backend

backend k8s-api-backend
    option tcp-check
    balance roundrobin
    server cp01 10.0.1.51:6443 check fall 3 rise 2
    server cp02 10.0.1.52:6443 check fall 3 rise 2
    server cp03 10.0.1.53:6443 check fall 3 rise 2

Validate the configuration syntax:

sudo haproxy -c -f /etc/haproxy/haproxy.cfg

You should see:

Configuration file is valid

Configure Keepalived

Open the keepalived configuration:

sudo vi /etc/keepalived/keepalived.conf

Add this configuration. Replace eth0 with your actual network interface name (check with ip a):

vrrp_script check_haproxy {
    script "/usr/bin/killall -0 haproxy"
    interval 2
    weight 2
}

vrrp_instance VI_1 {
    state MASTER
    interface eth0
    virtual_router_id 51
    priority 100
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass k8s-vip-secret
    }
    virtual_ipaddress {
        10.0.1.60/24
    }
    track_script {
        check_haproxy
    }
}

Start and enable both services:

sudo systemctl enable --now haproxy
sudo systemctl enable --now keepalived

Verify the VIP is active on the load balancer:

ip addr show eth0 | grep 10.0.1.60

The output confirms the VIP is assigned:

    inet 10.0.1.60/24 scope global secondary eth0

At this point, HAProxy will report all backends as DOWN because no API server is running yet. That’s expected.

Bootstrap the First Control Plane Node

Run these commands on cp01 only. This is the initial control plane node that seeds the cluster, uploads certificates, and generates the join tokens for the other nodes.

sudo kubeadm init \
  --control-plane-endpoint "k8s-api:6443" \
  --upload-certs \
  --pod-network-cidr=10.244.0.0/16

The --control-plane-endpoint flag tells every node to contact the API server through the VIP hostname. The --upload-certs flag encrypts and uploads control plane certificates to a kubeadm-certs Secret so that joining control plane nodes can pull them automatically. Without it, you’d have to manually copy certificates between nodes.

When the init completes, kubeadm prints two join commands: one for additional control plane nodes and one for workers. Save both of these. The certificate key expires after 2 hours. The output looks like this:

Your Kubernetes control-plane has initialized successfully!

To start using your cluster, you need to run the following as a regular user:

  mkdir -p $HOME/.kube
  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
  sudo chown $(id -u):$(id -g) $HOME/.kube/config

You can now join any number of control-plane nodes by running the following command on each:

  kubeadm join k8s-api:6443 --token abcdef.0123456789abcdef \
    --discovery-token-ca-cert-hash sha256:abcdef0123456789... \
    --control-plane --certificate-key abcdef0123456789...

Then you can join any number of worker nodes by running the following on each as root:

  kubeadm join k8s-api:6443 --token abcdef.0123456789abcdef \
    --discovery-token-ca-cert-hash sha256:abcdef0123456789...

Set up kubectl access on cp01. (For a kubectl reference, see our kubectl and kubectx cheat sheet.)

mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

Quick check that the API server responds through the VIP:

kubectl cluster-info

The output should reference the VIP endpoint:

Kubernetes control plane is running at https://k8s-api:6443
CoreDNS is running at https://k8s-api:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

The node will show as NotReady until a CNI plugin is installed. That’s the next step.

Install Calico CNI

Calico provides both pod networking and network policy enforcement. We’re using the Tigera operator method, which manages Calico as a set of custom resources rather than a raw manifest. This makes upgrades cleaner and configuration more declarative.

Apply the Tigera operator:

kubectl create -f https://raw.githubusercontent.com/projectcalico/calico/v3.29.3/manifests/tigera-operator.yaml

Create the Installation custom resource. The CIDR must match the --pod-network-cidr you passed to kubeadm init:

cat <<EOF | kubectl apply -f -
apiVersion: operator.tigera.io/v1
kind: Installation
metadata:
  name: default
spec:
  calicoNetwork:
    ipPools:
    - blockSize: 26
      cidr: 10.244.0.0/16
      encapsulation: VXLANCrossSubnet
      natOutgoing: Enabled
      nodeSelector: all()
---
apiVersion: operator.tigera.io/v1
kind: APIServer
metadata:
  name: default
spec: {}
EOF

Watch the Calico pods come up:

watch kubectl get pods -n calico-system

Wait until all pods show Running. This typically takes 2 to 3 minutes:

NAME                                       READY   STATUS    RESTARTS   AGE
calico-kube-controllers-6b4b6f4d5c-x9kzn   1/1     Running   0          2m15s
calico-node-bq7kl                           1/1     Running   0          2m15s
calico-typha-7f4d6f5b5c-jk2md              1/1     Running   0          2m15s
csi-node-driver-lznp8                       2/2     Running   0          2m15s

Configure Failsafe Inbound Host Ports

This step is critical and catches most people off guard. Calico’s default rpfilter (reverse path filter) settings can block inter-node etcd traffic, which causes etcd health checks to fail when you try to join additional control plane nodes. Configure failsafe ports before joining cp02 and cp03:

kubectl patch felixconfiguration default --type=merge -p '{"spec":{"failsafeInboundHostPorts":[{"protocol":"tcp","port":22},{"protocol":"tcp","port":6443},{"protocol":"tcp","port":2379},{"protocol":"tcp","port":2380},{"protocol":"tcp","port":10250},{"protocol":"tcp","port":10259},{"protocol":"tcp","port":10257},{"protocol":"tcp","port":179},{"protocol":"udp","port":4789},{"protocol":"tcp","port":5473}]}}'

Verify the configuration was applied:

kubectl get felixconfiguration default -o jsonpath='{.spec.failsafeInboundHostPorts}' | python3 -m json.tool

You should see the full list of failsafe ports in the output. If you skip this step, the cp02 and cp03 join will hang during the etcd health check. The alternative workaround is to pass --skip-phases=check-etcd during join, but configuring the failsafe ports properly is the correct fix.

Cilium as an alternative: If you prefer eBPF-based networking, Cilium is a solid choice that avoids the rpfilter issue entirely. It’s more complex to configure for HA clusters but provides better observability through Hubble. For this guide, we stick with Calico because it’s the most battle-tested CNI for kubeadm HA setups.

Join Additional Control Plane Nodes

Run the control plane join command (from the kubeadm init output) on cp02 and cp03. The command includes the --control-plane flag and the certificate key:

sudo kubeadm join k8s-api:6443 --token abcdef.0123456789abcdef \
  --discovery-token-ca-cert-hash sha256:abcdef0123456789... \
  --control-plane --certificate-key abcdef0123456789...

Replace the token, hash, and certificate key with your actual values from the init output. If the certificate key has expired (older than 2 hours), regenerate it on cp01:

sudo kubeadm init phase upload-certs --upload-certs

If the join hangs during the etcd health check despite configuring failsafe ports, you can bypass it with:

sudo kubeadm join k8s-api:6443 --token abcdef.0123456789abcdef \
  --discovery-token-ca-cert-hash sha256:abcdef0123456789... \
  --control-plane --certificate-key abcdef0123456789... \
  --skip-phases=check-etcd

Use this only as a last resort. The failsafe port configuration from the previous section should resolve it. After both nodes join, set up kubectl access on each one:

mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

Back on cp01, verify all three control plane nodes are Ready:

kubectl get nodes

The output should show three control plane nodes:

NAME     STATUS   ROLES           AGE     VERSION
cp01     Ready    control-plane   12m     v1.35.3
cp02     Ready    control-plane   4m31s   v1.35.3
cp03     Ready    control-plane   2m18s   v1.35.3

Join Worker Nodes

On worker01 and worker02, run the worker join command from the kubeadm init output. This is the simpler join without the --control-plane flag:

sudo kubeadm join k8s-api:6443 --token abcdef.0123456789abcdef \
  --discovery-token-ca-cert-hash sha256:abcdef0123456789...

Workers join quickly since they don’t need to sync etcd or download control plane certificates. Back on cp01, confirm all five nodes are part of the cluster:

kubectl get nodes -o wide

All five nodes should show Ready:

NAME       STATUS   ROLES           AGE     VERSION    INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION          CONTAINER-RUNTIME
cp01       Ready    control-plane   18m     v1.35.3    10.0.1.51     <none>        Ubuntu 24.04.4 LTS   6.8.0-101-generic       containerd://2.2.2
cp02       Ready    control-plane   10m     v1.35.3    10.0.1.52     <none>        Ubuntu 24.04.4 LTS   6.8.0-101-generic       containerd://2.2.2
cp03       Ready    control-plane   8m      v1.35.3    10.0.1.53     <none>        Ubuntu 24.04.4 LTS   6.8.0-101-generic       containerd://2.2.2
worker01   Ready    <none>          3m12s   v1.35.3    10.0.1.54     <none>        Ubuntu 24.04.4 LTS   6.8.0-101-generic       containerd://2.2.2
worker02   Ready    <none>          2m45s   v1.35.3    10.0.1.55     <none>        Ubuntu 24.04.4 LTS   6.8.0-101-generic       containerd://2.2.2

Verify the HA Cluster

A cluster with all nodes Ready doesn’t prove HA is working. You need to verify etcd quorum, test failover, and confirm workloads schedule correctly across workers.

Check etcd Cluster Health

List the etcd members from cp01:

sudo ETCDCTL_API=3 etcdctl \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
  member list -w table

All three members should show as started:

+------------------+---------+------+-------------------------+-------------------------+------------+
|        ID        | STATUS  | NAME | PEER ADDRS              | CLIENT ADDRS            | IS LEARNER |
+------------------+---------+------+-------------------------+-------------------------+------------+
| 2a1b3c4d5e6f7a8b | started | cp01 | https://10.0.1.51:2380  | https://10.0.1.51:2379  |      false |
| 3b2c4d5e6f7a8b9c | started | cp02 | https://10.0.1.52:2380  | https://10.0.1.52:2379  |      false |
| 4c3d5e6f7a8b9c0d | started | cp03 | https://10.0.1.53:2380  | https://10.0.1.53:2379  |      false |
+------------------+---------+------+-------------------------+-------------------------+------------+

Check endpoint health across all three:

sudo ETCDCTL_API=3 etcdctl \
  --endpoints=https://10.0.1.51:2379,https://10.0.1.52:2379,https://10.0.1.53:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
  endpoint health -w table

All endpoints should report as healthy:

+-------------------------+--------+-------------+-------+
|        ENDPOINT         | HEALTH |    TOOK     | ERROR |
+-------------------------+--------+-------------+-------+
| https://10.0.1.51:2379  |   true |  12.456ms   |       |
| https://10.0.1.52:2379  |   true |  14.231ms   |       |
| https://10.0.1.53:2379  |   true |  13.892ms   |       |
+-------------------------+--------+-------------+-------+

Test a Workload Deployment

Deploy a simple nginx workload with 3 replicas to confirm scheduling works across the worker nodes:

kubectl create deployment nginx-test --image=nginx:latest --replicas=3

Check where the pods landed:

kubectl get pods -o wide -l app=nginx-test

Pods should be distributed across the worker nodes:

NAME                          READY   STATUS    RESTARTS   AGE   IP             NODE       NOMINATED NODE   READINESS GATES
nginx-test-7c5b8d6f4-2kxnm   1/1     Running   0          28s   10.244.1.12    worker01   <none>           <none>
nginx-test-7c5b8d6f4-8mz4p   1/1     Running   0          28s   10.244.2.8     worker02   <none>           <none>
nginx-test-7c5b8d6f4-j3f7s   1/1     Running   0          28s   10.244.1.13    worker01   <none>           <none>

Clean up the test deployment:

kubectl delete deployment nginx-test

Rocky Linux 10 vs Ubuntu 24.04 Comparison

If you’re running Rocky Linux 10 (or AlmaLinux 10, RHEL 10) instead of Ubuntu, several commands and paths differ. This table covers the key differences so you can adapt the guide to your distribution.

ItemRocky Linux 10 / RHEL 10Ubuntu 24.04
Package managerdnf installapt-get install
containerd installDocker CE repo for RHEL: dnf config-manager --add-repoDocker CE repo for Ubuntu: apt sources
K8s repo formatpkgs.k8s.io/core:/stable:/v1.35/rpm/pkgs.k8s.io/core:/stable:/v1.35/deb/
Hold packagesdnf versionlock add kubelet kubeadm kubectlapt-mark hold kubelet kubeadm kubectl
Firewall toolfirewall-cmd --permanent --add-port=6443/tcpufw allow 6443/tcp
SELinux / AppArmorSELinux enforcing. Requires setsebool -P container_manage_cgroup onAppArmor active. No extra config needed for K8s
Service namesSame: kubelet, containerd, haproxySame: kubelet, containerd, haproxy
Config paths/etc/containerd/config.toml (same)/etc/containerd/config.toml (same)
Kernel modulesSame approach: /etc/modules-load.d/Same approach: /etc/modules-load.d/

On Rocky Linux 10 with SELinux enforcing, you also need to allow the kubelet to manage container cgroups. Without this, pod creation fails with permission denied errors.

Production Hardening

A working HA cluster is just the foundation. Before running real workloads, address these areas.

Certificate Rotation

Kubeadm certificates expire after one year by default. The kubelet handles its own certificate rotation automatically, but the control plane certificates (API server, etcd, front-proxy) need manual renewal. Check expiration dates:

sudo kubeadm certs check-expiration

Renew all certificates before they expire:

sudo kubeadm certs renew all

After renewal, restart the control plane static pods by moving the manifests out and back:

sudo mv /etc/kubernetes/manifests/*.yaml /tmp/
sleep 10
sudo mv /tmp/kube-*.yaml /tmp/etcd.yaml /etc/kubernetes/manifests/

Run the renewal on each control plane node. Set a calendar reminder for 11 months out.

etcd Backups

etcd holds the entire cluster state. If all three etcd members are lost without a backup, the cluster is gone. Schedule automated snapshots on one control plane node:

sudo ETCDCTL_API=3 etcdctl snapshot save /var/backups/etcd-snapshot-$(date +%Y%m%d).db \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key

Verify the snapshot is valid:

sudo ETCDCTL_API=3 etcdctl snapshot status /var/backups/etcd-snapshot-$(date +%Y%m%d).db -w table

Copy snapshots to off-cluster storage. An etcd backup on the same machine it’s backing up is not really a backup.

Network Policies

Calico supports Kubernetes NetworkPolicy out of the box. By default, all pod-to-pod traffic is allowed. At minimum, create a default-deny ingress policy for each namespace, then explicitly allow the traffic patterns your applications need. This prevents lateral movement if a pod is compromised. Once your cluster is stable, set up Prometheus and Grafana to monitor cluster health and resource usage.

Pod Security Standards

Kubernetes 1.35 enforces Pod Security Standards via the built-in admission controller. Label your namespaces to apply the appropriate level:

kubectl label namespace production pod-security.kubernetes.io/enforce=restricted
kubectl label namespace production pod-security.kubernetes.io/warn=restricted

The restricted level prevents running as root, disallows privilege escalation, and requires dropping all Linux capabilities. Start with baseline if restricted breaks your existing workloads, then work toward restricted over time.

Troubleshooting

Calico rpfilter Blocks Inter-Node Traffic

After installing Calico, you might notice that nodes can’t communicate on etcd or kubelet ports despite UFW rules being correct. Calico’s Felix component applies strict reverse path filtering by default. The fix is the FelixConfiguration patch shown earlier in this guide, which adds the Kubernetes ports to the failsafe list. Apply it before joining additional control plane nodes. If you’ve already joined them and they’re stuck, apply the patch and then restart kubelet on the affected nodes.

HAProxy Shows All Backends DOWN During Init

When you first start HAProxy, all three backend servers show as DOWN because no API server exists yet. This is normal. After the first control plane node bootstraps and the API server starts listening on port 6443, HAProxy detects it within a few health check intervals. If a backend stays DOWN after init completes, check that the control plane node’s firewall allows TCP 6443 from the load balancer IP.

etcd Health Check Fails During Control Plane Join

When joining cp02 or cp03, you might see the join process hang at “Running pre-flight checks” with etcd health check timeouts. This happens when Calico’s network policy blocks the joining node from reaching the existing etcd endpoint on port 2379. Two options: apply the failsafe inbound host ports patch (preferred), or pass --skip-phases=check-etcd to the join command as a workaround. The check-etcd phase is a pre-flight safety check, not a functional requirement. Skipping it lets the join proceed, and etcd will still form a healthy cluster.

Related Articles

Containers Force Delete Namespace in Kubernetes Containers Solve Error response from daemon: Get https://registry-1.docker.io/v2/: x509: certificate signed by unknown authority CentOS Install Graphite & Graphite-Web on CentOS 8 | RHEL 8 Ansible Run Semaphore Ansible web UI in Docker Container

Leave a Comment

Press ESC to close