Ansible with Kubernetes: Deploy and Manage

Ansible and Kubernetes meet at two points, and they pull in different directions.

Original content from computingforgeeks.com - post 168411

The first is provisioning: turning a pile of fresh Ubuntu machines into a working cluster. kubeadm assembles the cluster, but something has to disable swap, load kernel modules, install containerd, lay down the package repo, and run kubeadm in the right order on the right hosts. That something is Ansible. The second point is day-to-day management. Once the cluster runs, you create namespaces, push Deployments, install Helm charts, and drain nodes for patching. The kubernetes.core collection does all of that declaratively, from the same control node, in the same playbook language.

This guide covers both. We provision a kubeadm cluster with a set of Ansible roles, then manage real workloads on it with kubernetes.core: a Deployment, a Helm release, a node drain, and a worker added live. If Ansible itself is new on your control node, set it up first with the install Ansible guide; this article is part of the wider Ansible automation guide.

Run in June 2026 on Ubuntu 24.04 with Kubernetes 1.36 and the kubernetes.core 6.4 collection.

How Ansible and Kubernetes fit together

Keep the two jobs separate in your head, because they use different tools.

Provisioning runs against the nodes over SSH. Ansible becomes root, installs packages, and shells out to kubeadm. This is ordinary server automation that happens to end in a cluster. Management runs against the Kubernetes API, not the nodes. The kubernetes.core modules talk to the API server with the Python Kubernetes client, so they run on the control node itself and need a kubeconfig, not SSH. One repo holds both: roles for the first job, playbooks in a manage/ directory for the second.

Lab layout

Four machines, all Ubuntu 24.04:

Ansible controller, where you run the playbooks. It never joins the cluster.
One control-plane node (the kubeadm “first” node).
Two worker nodes to start. We add a third later without touching the first two.

The controller reaches every node as a sudo-capable user over an SSH key, which is the only prerequisite the roles assume. Give each node 2 vCPU and at least 2 GB of RAM; the control plane is happier with 4 GB. kubeadm refuses to start on a single CPU.

Set up the Ansible controller

The controller needs Ansible, the Python Kubernetes client in the same environment Ansible runs from, the kubernetes.core collection, and Helm. Install pipx, then layer the pieces on top:

sudo apt update
sudo apt install -y pipx python3-venv
pipx install --include-deps ansible
pipx inject ansible kubernetes
ansible-galaxy collection install kubernetes.core
curl -fsSL https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

The pipx inject step is the one people miss. The kubernetes.core modules import the kubernetes Python library at runtime, and they look for it in Ansible’s own virtualenv. Installing it with a separate pip puts it somewhere Ansible cannot see, and every task fails with “Failed to import the required Python library (kubernetes)”. Inject it into the Ansible venv and the problem disappears.

Confirm the collection and client are both present:

ansible-galaxy collection list | grep kubernetes

The collection and its version print on a single line:

kubernetes.core                          6.4.0

That confirms the collection and its Python client are both visible to Ansible, which is the combination the management playbooks depend on later.

Build the inventory

Group the nodes into a control plane and workers. The roles key off these group names, so the names matter.

[control_plane]
k8s-cp1 ansible_host=192.168.1.168

[workers]
k8s-w1 ansible_host=192.168.1.169
k8s-w2 ansible_host=192.168.1.170

[k8s_cluster:children]
control_plane
workers

A handful of cluster-wide settings live in group_vars/all.yml. This is also where the one networking decision that bites people gets made.

---
# Kubernetes minor version. This is the pkgs.k8s.io repo path; bump it to upgrade.
k8s_minor: "v1.36" # https://kubernetes.io/releases/

# Pod network CIDR handed to kubeadm and Calico.
# MUST NOT overlap your node/LAN subnet (the lab nodes are on 192.168.1.0/24).
pod_network_cidr: "10.244.0.0/16"

The pod network CIDR must not overlap the subnet your nodes sit on. Calico’s own default is 192.168.0.0/16, and plenty of home and office LANs live inside that range. If they overlap, pod traffic and node traffic fight over the same addresses and routing breaks in ways that are miserable to debug. The lab nodes here are on 192.168.1.0/24, so the pods get 10.244.0.0/16 instead. Pick any private range that your network does not already use.

Prepare every node

The common role runs on the whole cluster and does everything kubeadm expects to already be true: swap off, the bridge and overlay modules loaded, the networking sysctls set, containerd installed with the systemd cgroup driver, and the Kubernetes packages held at a fixed version.

---
# Prereqs that every node (control plane and workers) needs before kubeadm runs.

- name: Disable swap for the running session
  ansible.builtin.command: swapoff -a
  changed_when: false

- name: Disable swap permanently in fstab
  ansible.posix.mount:
    path: "{{ item }}"
    state: absent
  loop:
    - swap
    - none
  when: ansible_swaptotal_mb | int > 0

- name: Load kernel modules now
  community.general.modprobe:
    name: "{{ item }}"
    state: present
  loop:
    - overlay
    - br_netfilter

- name: Load kernel modules on boot
  ansible.builtin.copy:
    dest: /etc/modules-load.d/k8s.conf
    content: |
      overlay
      br_netfilter
    mode: "0644"

- name: Apply sysctl settings for Kubernetes networking
  ansible.posix.sysctl:
    name: "{{ item.key }}"
    value: "{{ item.value }}"
    sysctl_file: /etc/sysctl.d/k8s.conf
    reload: true
  loop:
    - { key: net.bridge.bridge-nf-call-iptables, value: "1" }
    - { key: net.bridge.bridge-nf-call-ip6tables, value: "1" }
    - { key: net.ipv4.ip_forward, value: "1" }

- name: Install containerd and apt prerequisites
  ansible.builtin.apt:
    name:
      - containerd
      - apt-transport-https
      - ca-certificates
      - curl
      - gpg
    state: present
    update_cache: true

- name: Create containerd config directory
  ansible.builtin.file:
    path: /etc/containerd
    state: directory
    mode: "0755"

- name: Generate default containerd config
  ansible.builtin.shell: containerd config default > /etc/containerd/config.toml
  args:
    creates: /etc/containerd/config.toml

- name: Use the systemd cgroup driver in containerd
  ansible.builtin.lineinfile:
    path: /etc/containerd/config.toml
    regexp: '^(\s*)SystemdCgroup\s*='
    line: '            SystemdCgroup = true'
  notify: Restart containerd

- name: Add the Kubernetes apt signing key
  ansible.builtin.get_url:
    url: "https://pkgs.k8s.io/core:/stable:/{{ k8s_minor }}/deb/Release.key"
    dest: /etc/apt/keyrings/kubernetes-apt-keyring.asc
    mode: "0644"

- name: Add the Kubernetes apt repository
  ansible.builtin.apt_repository:
    repo: "deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.asc] https://pkgs.k8s.io/core:/stable:/{{ k8s_minor }}/deb/ /"
    filename: kubernetes
    state: present

- name: Install kubelet, kubeadm and kubectl
  ansible.builtin.apt:
    name:
      - kubelet
      - kubeadm
      - kubectl
    state: present
    update_cache: true

- name: Hold the Kubernetes packages at their current version
  ansible.builtin.dpkg_selections:
    name: "{{ item }}"
    selection: hold
  loop:
    - kubelet
    - kubeadm
    - kubectl

- name: Enable and start kubelet
  ansible.builtin.systemd:
    name: kubelet
    enabled: true
    state: started

- name: Flush handlers so containerd restarts before kubeadm runs
  ansible.builtin.meta: flush_handlers

One edit earns its place above all the others. Containerd ships with SystemdCgroup set to false, and on a systemd host it has to be true, or the kubelet and containerd disagree about who owns the cgroups and pods never leave ContainerCreating. On containerd 2.x the setting lives under the CRI runc options in /etc/containerd/config.toml, which is why the role regenerates the default config first and edits that one line. The apt-mark hold at the end stops an unattended apt upgrade from dragging the cluster to a new minor behind your back.

Bring up the control plane

The control_plane role initialises the cluster, installs a kubeconfig for your login user, lays down the Calico CNI, and prints a join command the workers pick up. It is written to be safe to run twice: the creates: guard on kubeadm init means a second run never re-initialises a live cluster.

---
# Initialise the control plane, lay down kubeconfig, install Calico, and
# publish a join command the workers will pick up.

- name: Check whether the control plane is already initialised
  ansible.builtin.stat:
    path: /etc/kubernetes/admin.conf
  register: kubeadm_admin

- name: Pull control-plane images ahead of init
  ansible.builtin.command: kubeadm config images pull
  when: not kubeadm_admin.stat.exists
  changed_when: true

- name: Initialise the cluster with kubeadm
  ansible.builtin.command: >
    kubeadm init
    --pod-network-cidr={{ pod_network_cidr }}
    --apiserver-advertise-address={{ ansible_host }}
    --node-name={{ inventory_hostname }}
  args:
    creates: /etc/kubernetes/admin.conf
  register: kubeadm_init

- name: Create .kube directory for the login user
  ansible.builtin.file:
    path: "/home/{{ ansible_user }}/.kube"
    state: directory
    owner: "{{ ansible_user }}"
    group: "{{ ansible_user }}"
    mode: "0750"

- name: Install kubeconfig for the login user
  ansible.builtin.copy:
    src: /etc/kubernetes/admin.conf
    dest: "/home/{{ ansible_user }}/.kube/config"
    remote_src: true
    owner: "{{ ansible_user }}"
    group: "{{ ansible_user }}"
    mode: "0600"

- name: Detect the latest Calico release
  ansible.builtin.uri:
    url: https://api.github.com/repos/projectcalico/calico/releases/latest
    return_content: true
  register: calico_release

- name: Set the Calico version fact
  ansible.builtin.set_fact:
    calico_version: "{{ calico_release.json.tag_name }}"

- name: Install the Calico operator CRDs
  ansible.builtin.command: >
    kubectl --kubeconfig /etc/kubernetes/admin.conf apply --server-side --force-conflicts
    -f https://raw.githubusercontent.com/projectcalico/calico/{{ calico_version }}/manifests/operator-crds.yaml
  register: crds_apply
  changed_when: "'created' in crds_apply.stdout or 'configured' in crds_apply.stdout"

- name: Install the Tigera (Calico) operator
  ansible.builtin.command: >
    kubectl --kubeconfig /etc/kubernetes/admin.conf apply --server-side --force-conflicts
    -f https://raw.githubusercontent.com/projectcalico/calico/{{ calico_version }}/manifests/tigera-operator.yaml
  register: tigera_apply
  changed_when: "'created' in tigera_apply.stdout or 'configured' in tigera_apply.stdout"

- name: Wait for the Installation CRD to register
  ansible.builtin.command: >
    kubectl --kubeconfig /etc/kubernetes/admin.conf wait --for condition=established --timeout=90s
    crd/installations.operator.tigera.io
  changed_when: false

- name: Render the Calico Installation manifest
  ansible.builtin.template:
    src: calico-custom-resources.yaml.j2
    dest: /root/calico-custom-resources.yaml
    mode: "0644"

- name: Apply the Calico Installation
  ansible.builtin.command: >
    kubectl --kubeconfig /etc/kubernetes/admin.conf apply
    -f /root/calico-custom-resources.yaml
  register: calico_install
  changed_when: "'created' in calico_install.stdout or 'configured' in calico_install.stdout"

- name: Generate a worker join command
  ansible.builtin.command: kubeadm token create --print-join-command
  register: join_cmd
  changed_when: false

- name: Stash the join command for the worker play
  ansible.builtin.set_fact:
    kubeadm_join_command: "{{ join_cmd.stdout }}"

- name: Fetch the admin kubeconfig to the Ansible controller
  ansible.builtin.fetch:
    src: /etc/kubernetes/admin.conf
    dest: "{{ playbook_dir }}/admin.conf"
    flat: true

Calico ships as two pieces now. The operator CRDs go on first, then the operator itself, and only then does the Installation resource make sense to the API. Apply them out of order and you get error: ... installations.operator.tigera.io not found, which is exactly the trap the explicit CRD step and the kubectl wait avoid. The Installation manifest is a short template so the pod CIDR stays in one place:

apiVersion: operator.tigera.io/v1
kind: Installation
metadata:
  name: default
spec:
  calicoNetwork:
    ipPools:
      - name: default-ipv4-ippool
        blockSize: 26
        cidr: {{ pod_network_cidr }}
        encapsulation: VXLANCrossSubnet
        natOutgoing: Enabled
        nodeSelector: all()
---
apiVersion: operator.tigera.io/v1
kind: APIServer
metadata:
  name: default
spec: {}

The operator reconciles that Installation into a running Calico deployment a few seconds after the API server comes up, and the pod CIDR matches the one kubeadm was handed.

Join the workers

The worker role is short. It checks whether the node already belongs to a cluster, and if not, runs the join command the control-plane play stashed in a host fact. The stat guard is what makes re-runs cheap: a node that already joined is skipped, not rejoined.

---
- name: Check whether this node already joined the cluster
  ansible.builtin.stat:
    path: /etc/kubernetes/kubelet.conf
  register: kubelet_conf

- name: Join the node to the cluster
  ansible.builtin.command: "{{ hostvars[groups['control_plane'][0]]['kubeadm_join_command'] }}"
  when: not kubelet_conf.stat.exists
  changed_when: true

That is the whole worker role. The join runs at most once per node, which is what makes growing the cluster later a no-op for the nodes already in it.

Run the bootstrap

One playbook ties the three roles together in order: prepare every node, build the control plane, join the workers, then wait for the whole cluster to report Ready.

---
- name: Prepare every node for Kubernetes
  hosts: k8s_cluster
  become: true
  roles:
    - common

- name: Bring up the control plane
  hosts: control_plane
  become: true
  roles:
    - control_plane

- name: Join the worker nodes
  hosts: workers
  become: true
  roles:
    - worker

- name: Wait for all nodes to report Ready
  hosts: control_plane
  become: true
  tasks:
    - name: Wait for nodes to be Ready
      ansible.builtin.command: >
        kubectl --kubeconfig /etc/kubernetes/admin.conf
        wait --for=condition=Ready nodes --all --timeout=180s
      register: nodes_ready
      changed_when: false

    - name: Show the cluster
      ansible.builtin.command: kubectl --kubeconfig /etc/kubernetes/admin.conf get nodes -o wide
      register: get_nodes
      changed_when: false

    - name: Cluster nodes
      ansible.builtin.debug:
        var: get_nodes.stdout_lines

Run it from the controller:

ansible-playbook bootstrap.yml

The first run pulls the control-plane images and takes a few minutes; later runs are quick. When it finishes, every node is Ready and running the same Kubernetes version.

Ansible playbook output showing three Kubernetes 1.36 nodes in Ready state

That is a working cluster, built from four blank Ubuntu installs, with no manual SSH into any node. If you would rather understand the kubeadm steps by hand before automating them, the kubeadm install walkthrough covers the same flow one command at a time.

Manage workloads with kubernetes.core

From here the job changes. The kubernetes.core.k8s module sends manifests to the API server and reconciles them, the same way kubectl apply does, except it lives in a playbook you can template, loop, and gate on conditions. The playbook below creates a namespace, a ConfigMap, a Secret, a three-replica Deployment that consumes both, and a NodePort Service, then waits until the Deployment reports Available.

---
# Manage workloads on the cluster with the kubernetes.core collection.
# Runs on the Ansible controller and talks to the API server over the kubeconfig.
- name: Deploy a demo web app with Ansible
  hosts: localhost
  connection: local
  gather_facts: false
  vars:
    kubeconfig: "{{ lookup('env', 'HOME') }}/ansible-k8s/admin.conf"
    app_namespace: demo
  tasks:
    - name: Create the namespace
      kubernetes.core.k8s:
        kubeconfig: "{{ kubeconfig }}"
        api_version: v1
        kind: Namespace
        name: "{{ app_namespace }}"
        state: present

    - name: Publish the landing page as a ConfigMap
      kubernetes.core.k8s:
        kubeconfig: "{{ kubeconfig }}"
        state: present
        definition:
          apiVersion: v1
          kind: ConfigMap
          metadata:
            name: web-content
            namespace: "{{ app_namespace }}"
          data:
            index.html: |
              <h1>Deployed by Ansible</h1>
              <p>nginx on Kubernetes, managed end to end with kubernetes.core.</p>

    - name: Store an app secret
      kubernetes.core.k8s:
        kubeconfig: "{{ kubeconfig }}"
        state: present
        definition:
          apiVersion: v1
          kind: Secret
          metadata:
            name: web-secret
            namespace: "{{ app_namespace }}"
          type: Opaque
          stringData:
            api-key: rotate-me-in-vault

    - name: Deploy the web application
      kubernetes.core.k8s:
        kubeconfig: "{{ kubeconfig }}"
        state: present
        definition:
          apiVersion: apps/v1
          kind: Deployment
          metadata:
            name: web
            namespace: "{{ app_namespace }}"
            labels:
              app: web
          spec:
            replicas: 3
            selector:
              matchLabels:
                app: web
            template:
              metadata:
                labels:
                  app: web
              spec:
                containers:
                  - name: nginx
                    image: nginx:1.27
                    ports:
                      - containerPort: 80
                    volumeMounts:
                      - name: content
                        mountPath: /usr/share/nginx/html
                    env:
                      - name: API_KEY
                        valueFrom:
                          secretKeyRef:
                            name: web-secret
                            key: api-key
                volumes:
                  - name: content
                    configMap:
                      name: web-content

    - name: Expose the app on a NodePort
      kubernetes.core.k8s:
        kubeconfig: "{{ kubeconfig }}"
        state: present
        definition:
          apiVersion: v1
          kind: Service
          metadata:
            name: web
            namespace: "{{ app_namespace }}"
          spec:
            type: NodePort
            selector:
              app: web
            ports:
              - port: 80
                targetPort: 80
                nodePort: 30080

    - name: Wait until the deployment is Available
      kubernetes.core.k8s_info:
        kubeconfig: "{{ kubeconfig }}"
        api_version: apps/v1
        kind: Deployment
        name: web
        namespace: "{{ app_namespace }}"
        wait: true
        wait_condition:
          type: Available
          status: "True"
        wait_timeout: 150

    - name: List the running pods
      kubernetes.core.k8s_info:
        kubeconfig: "{{ kubeconfig }}"
        kind: Pod
        namespace: "{{ app_namespace }}"
        label_selectors:
          - app=web
      register: web_pods

    - name: Show pod names and the nodes they landed on
      ansible.builtin.debug:
        msg: "{{ web_pods.resources | map(attribute='metadata.name') | zip(web_pods.resources | map(attribute='spec.nodeName')) | list }}"

Notice it runs against localhost with connection: local. These tasks never SSH anywhere; they reach the API over the kubeconfig that bootstrap.yml fetched to the controller. The kubernetes.core.k8s_info calls at the end give you read access in the same language, with a wait_condition that blocks until the rollout is genuinely ready instead of guessing with a sleep.

ansible-playbook manage/01-deploy-app.yml

The pods land across the workers, and the ConfigMap content is served on the NodePort.

Ansible deploying an nginx Deployment, Service and pods on Kubernetes with kubernetes.core

Because the module reconciles state, re-running the playbook after an edit changes only what differs. Bump replicas to 5 and run it again and the three existing pods stay; Kubernetes adds two. That is the declarative model the kubernetes.core documentation builds on, and it is why this beats a pile of shell calls to kubectl.

Install a Helm chart with Ansible

Most real clusters run Helm charts, and Ansible drives Helm without dropping to the shell. The kubernetes.core.helm and helm_repository modules add a repo and install or upgrade a release. metrics-server is a good first one: it is what kubectl top needs, and a kubeadm cluster does not ship it.

---
# Install a Helm chart with Ansible. metrics-server powers `kubectl top`.
- name: Install metrics-server with Helm
  hosts: localhost
  connection: local
  gather_facts: false
  vars:
    kubeconfig: "{{ lookup('env', 'HOME') }}/ansible-k8s/admin.conf"
  tasks:
    - name: Add the metrics-server Helm repository
      kubernetes.core.helm_repository:
        name: metrics-server
        repo_url: https://kubernetes-sigs.github.io/metrics-server/

    - name: Install or upgrade the metrics-server release
      kubernetes.core.helm:
        kubeconfig: "{{ kubeconfig }}"
        name: metrics-server
        chart_ref: metrics-server/metrics-server
        release_namespace: kube-system
        state: present
        update_repo_cache: true
        # kubeadm issues self-signed kubelet certs, so skip TLS verification to it.
        values:
          args:
            - --kubelet-insecure-tls

    - name: Wait for the metrics-server rollout
      kubernetes.core.k8s_info:
        kubeconfig: "{{ kubeconfig }}"
        api_version: apps/v1
        kind: Deployment
        name: metrics-server
        namespace: kube-system
        wait: true
        wait_condition:
          type: Available
          status: "True"
        wait_timeout: 150

The --kubelet-insecure-tls argument is the gotcha. kubeadm gives each kubelet a self-signed serving certificate, and metrics-server refuses to scrape it unless you tell it to skip that verification. Without the flag the pod runs but kubectl top answers “Metrics API not available” forever. Install it, give it a scrape cycle, and node metrics appear.

ansible-playbook manage/02-helm-metrics-server.yml
kubectl top nodes

Node CPU and memory now report through the metrics API:

Ansible Helm playbook installing metrics-server with kubectl top nodes output

That is the case for running Helm through Ansible rather than by hand: the release is declared in a playbook you can re-run, template, and keep in version control next to everything else.

Drain a node and roll a deployment

Patching a node means moving its pods elsewhere first. kubernetes.core.k8s_drain cordons and drains in one task, and the matching uncordon brings the node back. The playbook drains a worker, confirms it is unschedulable, returns it to service, then triggers a rolling restart of the web Deployment by stamping a fresh annotation on the pod template.

---
# Day-2 operations: drain a node for maintenance, bring it back, roll a deployment.
- name: Node maintenance and a rolling restart with Ansible
  hosts: localhost
  connection: local
  gather_facts: false
  vars:
    kubeconfig: "{{ lookup('env', 'HOME') }}/ansible-k8s/admin.conf"
    target_node: k8s-w2
    app_namespace: demo
  tasks:
    - name: Cordon and drain the node
      kubernetes.core.k8s_drain:
        kubeconfig: "{{ kubeconfig }}"
        name: "{{ target_node }}"
        state: drain
        delete_options:
          ignore_daemonsets: true
          delete_emptydir_data: true
          terminate_grace_period: 30
          wait_timeout: 120

    - name: Confirm the node is unschedulable
      kubernetes.core.k8s_info:
        kubeconfig: "{{ kubeconfig }}"
        kind: Node
        name: "{{ target_node }}"
      register: drained_node

    - name: Node scheduling state
      ansible.builtin.debug:
        msg: "{{ target_node }} unschedulable = {{ drained_node.resources[0].spec.unschedulable | default(false) }}"

    # Real maintenance (kernel patch, reboot) would happen here.

    - name: Bring the node back into the scheduler
      kubernetes.core.k8s_drain:
        kubeconfig: "{{ kubeconfig }}"
        name: "{{ target_node }}"
        state: uncordon

    - name: Trigger a rolling restart of the web deployment
      kubernetes.core.k8s:
        kubeconfig: "{{ kubeconfig }}"
        state: patched
        kind: Deployment
        name: web
        namespace: "{{ app_namespace }}"
        definition:
          spec:
            template:
              metadata:
                annotations:
                  ansible.computingforgeeks.com/restartedAt: "{{ now(utc=true).isoformat() }}"

    - name: Wait for the rollout to finish
      kubernetes.core.k8s_info:
        kubeconfig: "{{ kubeconfig }}"
        api_version: apps/v1
        kind: Deployment
        name: web
        namespace: "{{ app_namespace }}"
        wait: true
        wait_condition:
          type: Available
          status: "True"
        wait_timeout: 150

Wrap the drain and uncordon around a real maintenance task and you have a repeatable patch window: drain, reboot the node with the reboot module, wait for it, uncordon. The rolling-restart trick at the end is the same one kubectl rollout restart uses under the surface, expressed as a patch.

ansible-playbook manage/03-day2-operations.yml

The node leaves and rejoins the scheduler, and the Deployment rolls one pod at a time:

Ansible day-2 playbook draining and uncordoning a Kubernetes worker node

The k8s_drain module handles the eviction along with the daemonset and emptydir edge cases that a hand-rolled wrapper around kubectl drain usually forgets.

Add a worker node

This is where the idempotent roles pay off. To grow the cluster, add the new node under [workers] in the inventory and run the same bootstrap playbook. Nothing else changes.

[workers]
k8s-w1 ansible_host=192.168.1.169
k8s-w2 ansible_host=192.168.1.170
k8s-w3 ansible_host=192.168.1.157

Then re-run the bootstrap, unchanged:

ansible-playbook bootstrap.yml

The existing nodes report changed=0 because their state already matches. Only the new node runs the prep tasks and the join, and it is Ready in under a minute.

Ansible re-run joining a new worker with kubectl get nodes showing four Ready nodes

The same loop scales the other way for cloud fleets: instead of editing the inventory by hand, pull the node list from your provider with dynamic inventory and let the count drive itself.

Troubleshooting

Failed to import the required Python library (kubernetes)

The kubernetes.core modules cannot find the Python client. It is almost always installed into the wrong environment. If you installed Ansible with pipx, run pipx inject ansible kubernetes so the client lands in Ansible’s venv. With a system Ansible, install python3-kubernetes from apt instead.

error: … installations.operator.tigera.io not found

Calico’s Installation resource was applied before its CRD existed. Recent Calico splits the CRDs into operator-crds.yaml, which has to go on before tigera-operator.yaml. The role applies them in that order and then runs kubectl wait --for condition=established on the CRD, so the race cannot happen.

Pods stuck in ContainerCreating, nodes never Ready

Two usual causes. Either containerd is still on the cgroupfs driver (check that SystemdCgroup = true is set in /etc/containerd/config.toml and containerd was restarted), or the pod CIDR overlaps your LAN and the CNI cannot route. Confirm the value in group_vars/all.yml is a range your network does not use.

kubectl top says “Metrics API not available”

metrics-server is running but cannot scrape the kubelets. On a kubeadm cluster it needs --kubelet-insecure-tls, set through Helm values as shown above. Give it thirty seconds after the rollout for the first scrape before deciding it is broken.

Take it to production

The cluster here has one control-plane node, which is fine for a lab and wrong for anything you depend on. The same roles extend in a few clear steps. Run three control-plane nodes behind a load balancer and pass --control-plane-endpoint to kubeadm init so the API has a stable address to fail over to. Keep the Kubernetes minor pinned in group_vars and bump it deliberately, one minor at a time, rather than letting apt decide. Most of all, stop storing Secrets as plain text in a playbook: the stringData field in the deploy example is readable to anyone with the repo, so move those values behind Ansible Vault and reference them as variables. The full set of roles and playbooks, ready to clone, lives in the companion repository.