Kubernetes Node Management in Rancher

Managing nodes is where Kubernetes operations get real. Anyone can spin up a cluster, but keeping it healthy over weeks and months requires knowing how to inspect nodes, schedule maintenance windows, and scale capacity without dropping traffic. Rancher gives you a unified view across clusters, though the real power still lives in kubectl.

Original content from computingforgeeks.com - post 165151

This guide walks through the full node lifecycle on a 3-node RKE2 HA cluster managed by Rancher v2.14.0: viewing node status, labeling and tainting, cordoning for maintenance, adding workers, and safely removing nodes. Every command was tested on a live cluster with real workloads.

Verified working: March 2026 on RKE2 v1.35.3 HA cluster, Rancher v2.14.0, Rocky Linux 10.1 (kernel 6.12), SELinux enforcing

View Node Status

The first thing you check when something feels off in a cluster is node health. kubectl get nodes with the -o wide flag gives you the essentials at a glance: status, roles, Kubernetes version, internal IP, OS image, and container runtime.

kubectl get nodes -o wide

On our 3-node HA cluster, all nodes show Ready with control-plane and etcd roles:

NAME        STATUS   ROLES                       AGE   VERSION          INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                       KERNEL-VERSION
rke2-ha-1   Ready    control-plane,etcd,master   12d   v1.32.4+rke2r1   10.0.1.11     <none>        Rocky Linux 10.1 (Blue Onyx)   6.12.0-55.el10.x86_64
rke2-ha-2   Ready    control-plane,etcd,master   12d   v1.32.4+rke2r1   10.0.1.12     <none>        Rocky Linux 10.1 (Blue Onyx)   6.12.0-55.el10.x86_64
rke2-ha-3   Ready    control-plane,etcd,master   12d   v1.32.4+rke2r1   10.0.1.13     <none>        Rocky Linux 10.1 (Blue Onyx)   6.12.0-55.el10.x86_64

For deeper inspection, kubectl describe node dumps everything: conditions, capacity, allocatable resources, running pods, and events. This is what you reach for when a node is in NotReady and you need to figure out why.

kubectl describe node rke2-ha-1

The Conditions section is the most useful part. Look for MemoryPressure, DiskPressure, and PIDPressure because any of those set to True means the node is struggling. Under normal operation, all three should be False.

To get a quick resource utilization snapshot across the cluster, use kubectl top nodes. This requires the metrics-server to be running (RKE2 deploys it by default).

kubectl top nodes

Here is what our 3-node cluster reports:

NAME        CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
rke2-ha-1   188m         9%     2535Mi          69%
rke2-ha-2   189m         9%     2431Mi          66%
rke2-ha-3   200m         10%    2417Mi          66%

CPU is comfortable at around 10%, but memory sits near 70% across all nodes. That is typical for RKE2 clusters with Rancher installed because the management components are memory-hungry. If you see memory climbing above 85%, it is time to add a worker node or increase RAM.

Node Labels and Annotations

Labels are key-value pairs attached to nodes that control where pods get scheduled. Annotations carry metadata (etcd snapshot paths, flannel config, RKE2 config hashes) but don’t affect scheduling. Both are visible in kubectl describe node output.

Add a custom label to designate a node’s role in your topology:

kubectl label node rke2-ha-3 node-role=worker

Verify the label was applied:

kubectl get nodes --show-labels | grep rke2-ha-3

You should see node-role=worker in the labels list alongside the default RKE2 labels like node-role.kubernetes.io/control-plane=true and node-role.kubernetes.io/etcd=true.

Query nodes by label to filter specific groups:

kubectl get nodes -l node-role=worker

The real value of labels is in pod scheduling. Use nodeSelector in a pod spec to pin workloads to labeled nodes. This is how you keep heavy batch jobs off your control-plane nodes or route GPU workloads to specific hardware.

apiVersion: v1
kind: Pod
metadata:
  name: worker-task
spec:
  nodeSelector:
    node-role: worker
  containers:
    - name: app
      image: nginx:latest

The scheduler will only place this pod on nodes carrying the node-role=worker label. If no node matches, the pod stays in Pending state until one becomes available.

To remove a label when it is no longer needed:

kubectl label node rke2-ha-3 node-role-

The trailing minus sign removes the key.

Taints and Tolerations

While labels pull pods toward specific nodes, taints push pods away. A tainted node rejects all pods unless they carry a matching toleration. This is the mechanism behind keeping user workloads off control-plane nodes in production clusters.

Taint a node to prevent general scheduling:

kubectl taint nodes rke2-ha-3 dedicated=monitoring:NoSchedule

Now only pods that explicitly tolerate this taint will land on rke2-ha-3. The NoSchedule effect blocks new pods but leaves existing ones running. Use NoExecute instead if you want to evict pods that are already there.

A pod spec with the matching toleration looks like this:

apiVersion: v1
kind: Pod
metadata:
  name: monitoring-agent
spec:
  tolerations:
    - key: "dedicated"
      operator: "Equal"
      value: "monitoring"
      effect: "NoSchedule"
  containers:
    - name: prometheus-agent
      image: prom/node-exporter:latest

This pod will schedule on rke2-ha-3 even with the taint in place. Other pods without this toleration will be rejected by the scheduler.

Remove the taint when you no longer need the restriction:

kubectl taint nodes rke2-ha-3 dedicated=monitoring:NoSchedule-

One thing that catches people off guard: RKE2 control-plane nodes come with a CriticalAddonsOnly taint by default. If you deploy workloads and they stay stuck in Pending, check whether the target node has taints you did not expect with kubectl describe node | grep Taints.

Cordon and Drain

Maintenance windows are a fact of life. Kernel updates, disk replacements, hardware upgrades. The safe sequence is: cordon the node, drain it, do the work, then uncordon.

Cordoning marks the node as unschedulable. Existing pods keep running, but no new pods will be placed there:

kubectl cordon rke2-ha-3

Check the status to confirm the cordon took effect:

kubectl get nodes

The output shows SchedulingDisabled appended to the status:

NAME        STATUS                     ROLES                       AGE   VERSION
rke2-ha-1   Ready                      control-plane,etcd,master   12d   v1.32.4+rke2r1
rke2-ha-2   Ready                      control-plane,etcd,master   12d   v1.32.4+rke2r1
rke2-ha-3   Ready,SchedulingDisabled   control-plane,etcd,master   12d   v1.32.4+rke2r1

Cordoning alone is not enough for maintenance. Pods are still running on the node. To safely evict them, drain the node:

kubectl drain rke2-ha-3 --ignore-daemonsets --delete-emptydir-data

The --ignore-daemonsets flag is almost always needed because DaemonSet pods (like flannel, kube-proxy, and monitoring agents) exist on every node by design and cannot be evicted. The --delete-emptydir-data flag allows eviction of pods using emptyDir volumes, acknowledging that their ephemeral data will be lost.

If a pod has a PodDisruptionBudget (PDB) that would be violated by the eviction, drain will block and wait. In production, this is the behavior you want because it prevents accidentally taking down a service that has only one replica left.

After completing maintenance, bring the node back online:

kubectl uncordon rke2-ha-3

The node immediately returns to Ready status and the scheduler resumes placing pods on it.

Add a Worker Node

When cluster resources are running thin (remember those 70% memory numbers from earlier), adding a worker node is straightforward with RKE2. The new node runs the agent component and joins using the cluster’s registration token.

On the new node, install the RKE2 agent:

curl -sfL https://get.rke2.io | INSTALL_RKE2_TYPE="agent" sh -

Create the agent configuration file pointing to an existing server node:

sudo mkdir -p /etc/rancher/rke2

Open the config file:

sudo vi /etc/rancher/rke2/config.yaml

Add the server URL and join token:

server: https://10.0.1.11:9345
token: your-cluster-join-token

Retrieve the join token from any existing server node at /var/lib/rancher/rke2/server/node-token. Then start the agent:

sudo systemctl enable --now rke2-agent.service

Within a minute or two, the new node appears in kubectl get nodes with a Ready status. The RKE2 documentation covers advanced agent configuration options including custom kubelet arguments and private registry mirrors.

Remove a Node

Removing a node cleanly requires three steps: drain the workloads, delete the node object from Kubernetes, and clean up the node itself.

Drain the node first to relocate all pods:

kubectl drain rke2-ha-3 --ignore-daemonsets --delete-emptydir-data --force

The --force flag is necessary here because some pods without a controller (bare pods not managed by a Deployment or StatefulSet) cannot be evicted gracefully. Since you are permanently removing the node, forcing eviction is acceptable.

Delete the node object from the cluster:

kubectl delete node rke2-ha-3

On the node being removed, stop and disable the RKE2 service, then run the uninstall script:

sudo systemctl stop rke2-agent.service
sudo systemctl disable rke2-agent.service
sudo /usr/local/bin/rke2-uninstall.sh

For control-plane nodes, use rke2-uninstall.sh instead of the agent variant. Be extremely careful removing control-plane nodes from an HA cluster. Losing quorum (more than half the etcd members) means the cluster cannot process writes. Never remove more than one control-plane node at a time from a 3-node cluster.

Rancher UI Node Management

Everything discussed so far can also be done through the Rancher web interface. Navigate to Cluster Management and select your cluster to see the node list. Rancher shows each node’s state, roles, IP address, CPU and memory usage, pod count, and age.

On our cluster, the Rancher dashboard shows all three nodes in Active state with their internal IPs (10.0.1.11, 10.0.1.12, 10.0.1.13), resource utilization graphs, and role assignments. The UI provides quick actions for cordoning, draining, and editing labels and taints without touching kubectl.

Click on any node to see detailed metrics: CPU and memory usage over time, running pods with their resource requests and limits, node conditions, and system information. The Events tab is particularly useful for troubleshooting because it surfaces Kubernetes events that you would otherwise need to parse from kubectl describe node output.

Rancher also shows node annotations that carry operational metadata. On RKE2 clusters, you will see annotations for etcd snapshot timestamps, flannel VXLAN network configuration, and the RKE2 config hash (which tells you if a node’s configuration has drifted from the expected state). These are not visible in the basic kubectl get nodes output, making the Rancher UI a faster way to audit cluster consistency.

For multi-cluster environments, the Rancher UI becomes essential. Switching between clusters in kubectl means juggling kubeconfig contexts, while Rancher presents all clusters in a single pane with consistent health indicators.

Monitoring Recommendations

Node management does not end at manual inspection. In production, you need automated monitoring that alerts before problems become outages. Here is what to track:

Node readiness – Alert if any node stays in NotReady for more than 2 minutes. This catches kubelet crashes, network partitions, and failed health checks
Memory pressure – Our cluster already sits at 66-69% memory usage. Set a warning threshold at 80% and critical at 90%. By the time the kernel OOM-killer fires, it is too late
Disk pressure – Monitor both root filesystem and /var/lib/rancher where container images and etcd data live. RKE2 etcd snapshots can grow large if retention is not configured
Pod eviction events – Track eviction events to catch capacity issues before they cascade. Frequent evictions mean the cluster is undersized
Certificate expiry – RKE2 manages its own certificates, but monitoring their expiry dates prevents surprise outages. Check with kubectl get csr and inspect the serving certs

Rancher includes a built-in monitoring stack based on Prometheus and Grafana. Enable it from Cluster Tools in the Rancher UI. It deploys node-exporter on every node, Prometheus for metrics collection, and pre-built Grafana dashboards for cluster and node-level visibility. For a complete reference on inspecting and managing Kubernetes resources from the command line, see our kubectl cheat sheet.