Using AI Coding Agents for DevOps: Terraform, Ansible, and Kubernetes with OpenCode

AI coding agents aren’t just for web developers cranking out React components. If you spend your days writing Terraform modules, Ansible playbooks, Kubernetes manifests, and bash scripts, these tools fit right into your workflow. OpenCode, paired with the Oh-My-OpenAgent plugin, turns a terminal into a multi-agent system that can generate, review, and refactor infrastructure code across any LLM provider you configure.

Original content from computingforgeeks.com - post 165425

This guide walks through practical, real-world examples of using OpenCode to produce DevOps infrastructure code. We cover Terraform, Ansible, Kubernetes YAML, and shell scripting, with honest assessments of what the tool gets right and where you still need to apply your own judgment. The install guide for OpenCode and Oh-My-OpenAgent on Linux covers the setup process if you haven’t done that yet.

Tested April 2026 with OpenCode 1.4.0, Oh-My-OpenAgent 2.1.0 on Rocky Linux 9.5

What You’ll Learn

How to generate production-quality Terraform modules with OpenCode prompts
Generating and reviewing Ansible playbooks with SELinux and handler considerations
Creating Kubernetes deployments, services, and ingress manifests from natural language
Writing shell scripts with proper error handling and logging
Using Oh-My-OpenAgent’s multi-agent mode to tackle complex infrastructure tasks
What AI agents consistently get wrong with infrastructure code (and how to catch it)

Prerequisites

OpenCode installed and configured with an API key (see the installation guide)
Oh-My-OpenAgent plugin installed for multi-agent orchestration
Basic familiarity with Terraform, Ansible, Kubernetes, and shell scripting
Tested on: Rocky Linux 9.5, OpenCode 1.4.0, Oh-My-OpenAgent 2.1.0
At least one LLM provider configured (Claude, GPT-4o, DeepSeek, or a local model via Ollama)

How AI Agents Handle Infrastructure Code

AI coding agents work best when they have clear constraints: a specific cloud provider, a target OS, a defined architecture. Vague prompts like “set up my infrastructure” produce vague results. Specific prompts like “create a Terraform module for an AWS VPC with two public subnets, two private subnets, and a NAT gateway in us-east-1” produce code you can actually use.

What works well: generating boilerplate (variable definitions, output blocks, resource scaffolding), following established patterns (three-tier architectures, standard Kubernetes deployment specs), and producing syntactically correct HCL, YAML, and bash. Where it falls short: understanding your specific environment’s constraints, getting provider version pinning right, and handling complex state dependencies. Treat AI-generated infrastructure code the same way you’d treat a pull request from a junior engineer. Review everything, run terraform plan, test in staging.

Generating Terraform Code with OpenCode

Start OpenCode in your Terraform project directory. The agent works best when it can see your existing module structure and provider configuration.

cd ~/terraform-projects/aws-infrastructure

opencode

Once inside the OpenCode TUI, give it a specific prompt for the VPC module:

opencode run "Create a Terraform module for an AWS VPC with 2 public subnets, 2 private subnets, a NAT gateway, and outputs for all subnet IDs. Use variables for CIDR ranges and region. Target Terraform 1.9+ and AWS provider 5.x."

OpenCode generates the module as separate files. Here’s what a typical variables.tf looks like:

variable "vpc_cidr" {

  description = "CIDR block for the VPC"
  type        = string
  default     = "10.0.0.0/16"
}

variable "region" {
  description = "AWS region"
  type        = string
  default     = "us-east-1"
}

variable "public_subnet_cidrs" {
  description = "CIDR blocks for public subnets"
  type        = list(string)
  default     = ["10.0.1.0/24", "10.0.2.0/24"]
}

variable "private_subnet_cidrs" {
  description = "CIDR blocks for private subnets"
  type        = list(string)
  default     = ["10.0.10.0/24", "10.0.11.0/24"]
}

variable "environment" {
  description = "Environment name for tagging"
  type        = string
  default     = "production"
}

variable "availability_zones" {
  description = "List of availability zones"
  type        = list(string)
  default     = ["us-east-1a", "us-east-1b"]
}

The generated main.tf creates the VPC, subnets, internet gateway, NAT gateway, and route tables:

terraform {

  required_version = ">= 1.9.0"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

resource "aws_vpc" "main" {
  cidr_block           = var.vpc_cidr
  enable_dns_support   = true
  enable_dns_hostnames = true

  tags = {
    Name        = "${var.environment}-vpc"
    Environment = var.environment
    ManagedBy   = "terraform"
  }
}

resource "aws_internet_gateway" "main" {
  vpc_id = aws_vpc.main.id

  tags = {
    Name = "${var.environment}-igw"
  }
}

resource "aws_subnet" "public" {
  count                   = length(var.public_subnet_cidrs)
  vpc_id                  = aws_vpc.main.id
  cidr_block              = var.public_subnet_cidrs[count.index]
  availability_zone       = var.availability_zones[count.index]
  map_public_ip_on_launch = true

  tags = {
    Name = "${var.environment}-public-${count.index + 1}"
    Tier = "public"
  }
}

resource "aws_subnet" "private" {
  count             = length(var.private_subnet_cidrs)
  vpc_id            = aws_vpc.main.id
  cidr_block        = var.private_subnet_cidrs[count.index]
  availability_zone = var.availability_zones[count.index]

  tags = {
    Name = "${var.environment}-private-${count.index + 1}"
    Tier = "private"
  }
}

resource "aws_eip" "nat" {
  domain = "vpc"

  tags = {
    Name = "${var.environment}-nat-eip"
  }
}

resource "aws_nat_gateway" "main" {
  allocation_id = aws_eip.nat.id
  subnet_id     = aws_subnet.public[0].id

  tags = {
    Name = "${var.environment}-nat-gw"
  }

  depends_on = [aws_internet_gateway.main]
}

resource "aws_route_table" "public" {
  vpc_id = aws_vpc.main.id

  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_internet_gateway.main.id
  }

  tags = {
    Name = "${var.environment}-public-rt"
  }
}

resource "aws_route_table" "private" {
  vpc_id = aws_vpc.main.id

  route {
    cidr_block     = "0.0.0.0/0"
    nat_gateway_id = aws_nat_gateway.main.id
  }

  tags = {
    Name = "${var.environment}-private-rt"
  }
}

resource "aws_route_table_association" "public" {
  count          = length(var.public_subnet_cidrs)
  subnet_id      = aws_subnet.public[count.index].id
  route_table_id = aws_route_table.public.id
}

resource "aws_route_table_association" "private" {
  count          = length(var.private_subnet_cidrs)
  subnet_id      = aws_subnet.private[count.index].id
  route_table_id = aws_route_table.private.id
}

And the outputs.tf to expose the IDs:

output "vpc_id" {

  description = "ID of the VPC"
  value       = aws_vpc.main.id
}

output "public_subnet_ids" {
  description = "IDs of public subnets"
  value       = aws_subnet.public[*].id
}

output "private_subnet_ids" {
  description = "IDs of private subnets"
  value       = aws_subnet.private[*].id
}

output "nat_gateway_id" {
  description = "ID of the NAT Gateway"
  value       = aws_nat_gateway.main.id
}

Reviewing the Generated Terraform

The agent produced clean, well-structured code. A few things it got right: proper tagging, depends_on for the NAT gateway, DNS support enabled on the VPC, and separate route tables for public and private subnets. Things to verify before running terraform apply:

Provider version: the ~> 5.0 constraint is broad. Pin it tighter in production, something like ~> 5.82 based on what you’re actually running
CIDR ranges: make sure they don’t overlap with existing VPCs or on-premises networks
Single NAT gateway: fine for dev, but production workloads need one per AZ for high availability
No VPC flow logs: you’ll want those for compliance and debugging

Iterating on the Module

The real power shows up when you iterate. Ask OpenCode to extend the module:

opencode run "Add a bastion host in the first public subnet with a security group allowing SSH from a configurable CIDR. Use Amazon Linux 2023 AMI. Include a variable for the SSH key name."

OpenCode reads the existing module files, understands the VPC structure, and adds the bastion resources that reference the subnets already defined. This context awareness is what separates an AI agent from a simple code snippet generator.

Always validate iterative changes:

terraform fmt -recursive

terraform validate
terraform plan

The plan output shows exactly what resources will be created, modified, or destroyed before you commit anything.

Generating Ansible Playbooks

Ansible playbook generation is where AI agents shine because playbooks follow predictable patterns: install packages, template configs, manage services, set firewall rules. Give OpenCode a clear target:

opencode run "Write an Ansible playbook that installs Nginx on Rocky Linux 9, configures a reverse proxy to a backend on port 8080, opens firewall ports 80 and 443, sets up Let's Encrypt SSL with certbot, and handles SELinux booleans. Use handlers for service restarts."

Here’s the kind of playbook OpenCode produces:

---

- name: Configure Nginx reverse proxy with SSL
  hosts: webservers
  become: true
  vars:
    backend_port: 8080
    domain_name: app.example.com
    certbot_email: [email protected]

  tasks:
    - name: Install Nginx and certbot
      ansible.builtin.dnf:
        name:
          - nginx
          - certbot
          - python3-certbot-nginx
        state: present

    - name: Start and enable Nginx
      ansible.builtin.systemd:
        name: nginx
        state: started
        enabled: true

    - name: Allow SELinux httpd network connections
      ansible.posix.seboolean:
        name: httpd_can_network_connect
        state: true
        persistent: true

    - name: Open HTTP and HTTPS in firewalld
      ansible.posix.firewalld:
        service: "{{ item }}"
        permanent: true
        immediate: true
        state: enabled
      loop:
        - http
        - https

    - name: Deploy Nginx reverse proxy config
      ansible.builtin.template:
        src: templates/nginx-reverse-proxy.conf.j2
        dest: /etc/nginx/conf.d/{{ domain_name }}.conf
        owner: root
        group: root
        mode: "0644"
      notify: Reload Nginx

    - name: Remove default server block
      ansible.builtin.file:
        path: /etc/nginx/conf.d/default.conf
        state: absent
      notify: Reload Nginx

    - name: Obtain SSL certificate
      ansible.builtin.command:
        cmd: >
          certbot certonly --nginx
          -d {{ domain_name }}
          --non-interactive
          --agree-tos
          -m {{ certbot_email }}
        creates: /etc/letsencrypt/live/{{ domain_name }}/fullchain.pem

    - name: Deploy SSL Nginx config
      ansible.builtin.template:
        src: templates/nginx-ssl-proxy.conf.j2
        dest: /etc/nginx/conf.d/{{ domain_name }}.conf
        owner: root
        group: root
        mode: "0644"
      notify: Reload Nginx

    - name: Set up certbot renewal cron
      ansible.builtin.cron:
        name: "certbot renewal"
        job: "certbot renew --quiet --post-hook 'systemctl reload nginx'"
        hour: "3"
        minute: "30"
        weekday: "1"

  handlers:
    - name: Reload Nginx
      ansible.builtin.systemd:
        name: nginx
        state: reloaded

What the Agent Got Right

The playbook uses FQCNs (fully qualified collection names) like ansible.builtin.dnf instead of the short form. It includes the httpd_can_network_connect SELinux boolean, which is essential for Nginx to proxy to a backend port on RHEL-family systems. Handlers are used correctly to reload Nginx only when configuration changes, not on every run. The creates parameter on the certbot task makes it idempotent.

What to Fix Before Running

A few adjustments for production use:

The playbook assumes the domain already resolves to the server. Add a verification task or document that as a prerequisite
The Jinja2 templates (nginx-reverse-proxy.conf.j2 and nginx-ssl-proxy.conf.j2) are referenced but not generated. Ask OpenCode for those separately
The certbot task uses --nginx plugin, which works well, but some teams prefer --standalone or --webroot for more control
Consider adding a nginx -t validation task before reloading to catch syntax errors in generated configs

Use the Oracle agent in Oh-My-OpenAgent for an architecture review:

opencode run "@oracle Review this Ansible playbook for production readiness. Check idempotency, error handling, and security."

The Oracle agent analyzes the playbook structure and flags issues like missing validate parameters on template tasks and the absence of a rollback strategy if certbot fails.

Kubernetes Manifests

Kubernetes YAML is notoriously verbose, which makes it an ideal candidate for AI generation. Most deployments follow the same pattern: Deployment, Service, maybe an Ingress or HPA.

opencode run "Create Kubernetes manifests for a Python Flask app: a Deployment with 3 replicas, resource limits, health checks, and a non-root security context. Add a ClusterIP Service and an Ingress with TLS. Use the image registry.example.com/flask-app:1.2.0."

The generated manifests cover all three resources. Here’s the Deployment:

apiVersion: apps/v1

kind: Deployment
metadata:
  name: flask-app
  labels:
    app: flask-app
    version: "1.2.0"
spec:
  replicas: 3
  selector:
    matchLabels:
      app: flask-app
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    metadata:
      labels:
        app: flask-app
        version: "1.2.0"
    spec:
      securityContext:
        runAsNonRoot: true
        runAsUser: 1000
        fsGroup: 1000
      containers:
        - name: flask-app
          image: registry.example.com/flask-app:1.2.0
          ports:
            - containerPort: 5000
              protocol: TCP
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
            limits:
              cpu: 500m
              memory: 256Mi
          livenessProbe:
            httpGet:
              path: /healthz
              port: 5000
            initialDelaySeconds: 10
            periodSeconds: 15
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /ready
              port: 5000
            initialDelaySeconds: 5
            periodSeconds: 10
          env:
            - name: FLASK_ENV
              value: "production"
            - name: PORT
              value: "5000"

The Service and Ingress:

---

apiVersion: v1
kind: Service
metadata:
  name: flask-app
  labels:
    app: flask-app
spec:
  type: ClusterIP
  selector:
    app: flask-app
  ports:
    - port: 80
      targetPort: 5000
      protocol: TCP
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: flask-app
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
spec:
  ingressClassName: nginx
  tls:
    - hosts:
        - flask.example.com
      secretName: flask-app-tls
  rules:
    - host: flask.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: flask-app
                port:
                  number: 80

Review Notes

Solid output overall. The security context with runAsNonRoot and a specific UID is a good practice many engineers skip. The rolling update strategy with maxUnavailable: 0 ensures zero downtime during deploys. A few things to check:

The health check endpoints (/healthz and /ready) must actually exist in your Flask app. If they don’t, the pods will crashloop
The Ingress assumes you have cert-manager and an nginx ingress controller installed. Adjust annotations for your specific setup (Traefik, AWS ALB, etc.)
Resource limits depend entirely on your application’s actual usage. Run load tests and adjust based on real metrics
No PodDisruptionBudget was generated, which you’ll want for production clusters

Preview what would change in your cluster before applying:

kubectl diff -f flask-app/

This shows a diff of what Kubernetes would modify without actually applying anything.

Shell Scripts and Automation

Backup scripts, deployment wrappers, log rotation, health checks. Every DevOps team has a collection of shell scripts that grew organically over the years. AI agents produce scripts with better structure and error handling than most hand-written ones.

opencode run "Write a bash script that dumps all PostgreSQL databases, compresses each with zstd, uploads to an S3 bucket with a date prefix, removes local dumps older than 7 days, and logs everything to syslog. Include error handling and exit on failure."

The output is a well-structured script with proper error handling:

#!/usr/bin/env bash

set -euo pipefail

# Configuration
S3_BUCKET="s3://company-backups/postgresql"
BACKUP_DIR="/var/backups/postgresql"
RETENTION_DAYS=7
DATE_PREFIX=$(date +%Y-%m-%d)
LOG_TAG="pg-backup"
PG_USER="postgres"

log() {
    logger -t "$LOG_TAG" "$1"
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1"
}

cleanup() {
    local exit_code=$?
    if [[ $exit_code -ne 0 ]]; then
        log "ERROR: Backup failed with exit code $exit_code"
    fi
    exit $exit_code
}

trap cleanup EXIT

mkdir -p "$BACKUP_DIR"

log "Starting PostgreSQL backup"

# Get list of databases, excluding templates
DATABASES=$(sudo -u "$PG_USER" psql -At -c \
    "SELECT datname FROM pg_database WHERE datistemplate = false AND datname != 'postgres';")

if [[ -z "$DATABASES" ]]; then
    log "WARNING: No databases found to back up"
    exit 0
fi

BACKUP_COUNT=0
FAILED_COUNT=0

for db in $DATABASES; do
    DUMP_FILE="$BACKUP_DIR/${db}_${DATE_PREFIX}.sql"
    COMPRESSED_FILE="${DUMP_FILE}.zst"

    log "Dumping database: $db"

    if sudo -u "$PG_USER" pg_dump --format=plain --verbose "$db" > "$DUMP_FILE" 2>/dev/null; then
        log "Compressing $db with zstd"
        zstd --rm -q -T0 "$DUMP_FILE"

        FILESIZE=$(stat --format=%s "$COMPRESSED_FILE" 2>/dev/null || stat -f%z "$COMPRESSED_FILE")
        log "Uploading $db ($FILESIZE bytes) to S3"

        if aws s3 cp "$COMPRESSED_FILE" "$S3_BUCKET/$DATE_PREFIX/${db}.sql.zst" --quiet; then
            log "Successfully uploaded $db to S3"
            ((BACKUP_COUNT++))
        else
            log "ERROR: Failed to upload $db to S3"
            ((FAILED_COUNT++))
        fi
    else
        log "ERROR: Failed to dump database $db"
        ((FAILED_COUNT++))
    fi
done

# Clean up old local backups
log "Removing local backups older than $RETENTION_DAYS days"
find "$BACKUP_DIR" -name "*.sql.zst" -mtime +$RETENTION_DAYS -delete

log "Backup complete: $BACKUP_COUNT succeeded, $FAILED_COUNT failed"

if [[ $FAILED_COUNT -gt 0 ]]; then
    exit 1
fi

Script Quality Assessment

This is production-ready with minor tweaks. The set -euo pipefail at the top catches errors early. The trap ensures cleanup logging happens even on failure. Dual logging to both syslog and stdout means you get output in cron emails and can search with journalctl -t pg-backup. The -T0 flag on zstd uses all available CPU cores for compression.

What to adjust: the pg_dump format is plain, which works for smaller databases. For anything over a few gigabytes, switch to --format=custom or --format=directory with --jobs for parallel dumping. Also consider adding a .pgpass file or environment variable for authentication instead of relying on peer auth.

Using Multi-Agent Mode for Complex Tasks

Oh-My-OpenAgent introduces specialized agents that collaborate on larger tasks. When you invoke the ultrawork command, three agents coordinate: Prometheus (the planner), Sisyphus (the orchestrator), and Hephaestus (the executor).

opencode run "@ultrawork Set up a complete CI/CD pipeline with GitHub Actions that builds a Docker image, pushes to ECR, runs Trivy security scan, deploys to EKS staging with Helm, runs integration tests, and promotes to production on approval."

Prometheus breaks the task into discrete components: the Dockerfile, GitHub Actions workflow, Helm chart, and deployment scripts. Sisyphus determines the execution order and dependencies between them. Hephaestus generates each file.

The result is typically five or six files:

.github/workflows/deploy.yml with the full pipeline including build, scan, stage, test, and promote jobs
Dockerfile with multi-stage build and non-root user
helm/values-staging.yaml and helm/values-production.yaml
scripts/integration-test.sh for post-deploy verification

Multi-agent mode produces more cohesive results than generating each file independently because the planner ensures all pieces reference each other correctly. The GitHub Actions workflow references the exact Helm values files, the integration test script hits the correct staging URL, and the Docker image tag propagates through every step.

You can also use the Momus agent specifically for code review:

opencode run "@momus Review the generated CI/CD pipeline for security issues, missing error handling, and production readiness."

Momus typically catches things like missing --immutable tags, absence of OIDC for ECR authentication (instead of long-lived access keys), and missing timeout values on GitHub Actions jobs.

Best Practices for AI-Generated Infrastructure Code

After testing OpenCode extensively with DevOps workflows, these practices consistently prevent issues.

Always dry-run before applying. Every tool in the DevOps ecosystem has a preview mode. Use it.

terraform plan -out=tfplan

ansible-playbook site.yml --check --diff
kubectl diff -f manifests/
shellcheck backup-script.sh

These commands should become muscle memory after every AI-generated code session.

Pin versions explicitly. AI agents tend to use loose version constraints or skip pinning entirely. Lock down provider versions in Terraform, collection versions in Ansible, and image tags in Kubernetes. A latest tag in a Deployment manifest is a ticking time bomb.

Review IAM policies and RBAC carefully. AI agents err on the side of permissiveness because overly restrictive permissions break things during testing. An Action: "*" in a generated IAM policy is functional but violates least privilege. Narrow it down to the specific actions your workload needs.

Test in isolation first. Create a throwaway environment (a dedicated Terraform workspace, a Kind cluster, a Vagrant box) and deploy the AI-generated code there before touching staging or production.

Use the review agents. Oh-My-OpenAgent includes Momus for code review and Oracle for architecture review. Running both on generated code catches issues that a single pass misses, because each agent evaluates from a different perspective.

What AI Agents Get Wrong

Honesty about limitations matters more than hype. After months of using AI agents for infrastructure code, these are the patterns where they consistently need correction.

Outdated provider and module versions. AI models have training data cutoffs. The agent might generate Terraform code with AWS provider 4.x syntax when 5.x changed the API. Always check the provider changelog and run terraform init -upgrade to catch incompatibilities.

SELinux and AppArmor are an afterthought. Most generated playbooks and scripts assume permissive mode or ignore mandatory access controls entirely. On RHEL-family systems with SELinux enforcing (which is every properly configured production server), missing setsebool or semanage commands cause silent failures that are painful to debug. Always check ausearch -m avc -ts recent after deploying AI-generated configurations.

Generic security groups and firewall rules. AI agents often open wider ranges than necessary. A generated security group allowing 0.0.0.0/0 on port 22 is technically correct but terrible practice. Restrict source CIDRs to your bastion network or VPN ranges.

Complex state and dependencies. Terraform state management, Ansible inventory patterns for multi-tier deployments, and Kubernetes operators with CRDs are areas where AI-generated code needs significant human review. The agent can scaffold the structure, but the business logic of “deploy database before the app” or “drain node before upgrading” requires understanding your specific architecture.

Secrets in plain text. Generated code sometimes puts passwords or API keys directly in YAML files or shell scripts. Always move secrets to Vault, AWS Secrets Manager, Kubernetes Secrets (or better, External Secrets Operator), or encrypted Ansible Vault files.

Frequently Asked Questions

Can AI agents replace DevOps engineers?

No. AI agents accelerate the parts of DevOps that are repetitive and pattern-based: writing boilerplate, scaffolding standard architectures, generating initial manifests. The judgment calls (which architecture to use, how to handle failure modes, what the security boundary should be) still require a human who understands the production environment. Think of AI agents as a faster way to get a first draft that you then refine.

Which LLM model works best for infrastructure code?

Claude Sonnet 4 and GPT-4o produce the most accurate Terraform and Kubernetes code in our testing. DeepSeek V3 is surprisingly good for Ansible playbooks and shell scripts, especially when running locally via Ollama for air-gapped environments. The model matters less than the specificity of your prompt. A detailed prompt with constraints produces better code on any model than a vague prompt on the best model.

Is AI-generated infrastructure code safe for production?

With review, yes. The generated code needs the same scrutiny you’d apply to any pull request: check for overly permissive IAM, validate resource limits, verify version pins, and test in staging. The tools exist to catch issues (terraform plan, ansible --check, kubectl diff, checkov, tfsec). Use them. Skip the review step and you’ll learn why the hard way.