Cloud

Monitor GCP Cert Rotations with Cloud Monitoring Runbooks

A cert rotation that you can’t see coming is a cert rotation that will page you at 3am when something breaks. Every consolidation outcome from this series depends on the rotations happening smoothly forever, which means the operational discipline — alerts, dashboards, runbooks — matters as much as the Terraform modules. This article closes the series with the monitoring module, the four runbooks every platform team ends up writing, and the post-mortem template for when an incident happens anyway.

Original content from computingforgeeks.com - post 166206

Tested April 2026 on Google Cloud Monitoring, Certificate Manager (global + europe-west1), Terraform 1.9.8, Cloud Logging log-based metrics. Module published as infra-gcp/modules/cert-monitoring/.

What to Monitor

Five classes of signal matter for cert infrastructure:

  1. Cert expiry proximity. Expiration is the most common cert incident. Alert at 60, 30, and 7 days out. The long alert gives room to investigate why auto-renewal failed; the 7-day alert is the last-chance escalation.
  2. Cert provisioning state changes. A cert moving from ACTIVE to PROVISIONING is either a rotation in progress (expected) or a validation failure (not expected). Alert on FAILED state and on PROVISIONING lasting longer than 30 minutes.
  3. DNS authorization health. Private CA doesn’t care. Certificate Manager cares deeply: if the ACME CNAME disappears or the DS record breaks DNSSEC, the next renewal fails silently. Catch this before the cert expires.
  4. ACME challenge failures in logs. certificatemanager.googleapis.com logs every attempt. Build a log-based metric counting failures and alert on it.
  5. SSL policy drift. Someone re-enabling TLS 1.0/1.1 on a forwarding rule should never happen, but when it does, the fleet quietly loses compliance posture. Alert on any change to the SSL policy resource.

The Monitoring Terraform Module

Each signal class becomes a google_monitoring_alert_policy with a documentation block linking to the runbook that handles it. Module at infra-gcp/modules/cert-monitoring/:

resource "google_monitoring_alert_policy" "cert_expiry_60d" {
  project      = var.project_id
  display_name = "Cert expiring in 60 days"
  combiner     = "OR"

  conditions {
    display_name = "Cert days-until-expiry < 60"
    condition_threshold {
      filter          = <<-EOT
        resource.type = "certificatemanager.googleapis.com/Certificate"
        AND metric.type = "certificatemanager.googleapis.com/certificate/expiry_seconds"
      EOT
      threshold_value = 60 * 24 * 60 * 60 # 60 days in seconds
      comparison      = "COMPARISON_LT"
      duration        = "300s"

      aggregations {
        alignment_period   = "300s"
        per_series_aligner = "ALIGN_MIN"
      }
    }
  }

  notification_channels = var.notification_channels

  documentation {
    content   = "Runbook: ${var.runbook_url_base}/cert-rotation.md"
    mime_type = "text/markdown"
  }
}

Three copies of this resource cover the 60/30/7-day thresholds. The documentation.content field is pure markdown and shows up in the PagerDuty/email notification; link directly at the runbook so the on-call doesn’t have to hunt for it.

Log-Based Metric for ACME Failures

Certificate Manager logs structured events under certificatemanager.googleapis.com. A log-based metric surfaces failure events as time-series data:

resource "google_logging_metric" "acme_challenge_failures" {
  project = var.project_id
  name    = "cert_manager/acme_challenge_failures"

  filter = <<-EOT
    resource.type = "audited_resource"
    AND protoPayload.serviceName = "certificatemanager.googleapis.com"
    AND protoPayload.response.managed.state = "FAILED"
  EOT

  metric_descriptor {
    metric_kind = "DELTA"
    value_type  = "INT64"
  }
}

Alert on the metric going above zero over any 5-minute window. The signal is rare (under normal conditions the count is zero forever), so any fire is real.

Uptime Checks (End-to-End Proof)

Metrics + alerts catch API-level state changes. Uptime checks catch everything else — DNS propagation lag, network failures, browser-incompatible cert chains. Set up a regional uptime check per hostname, polling every minute from at least three geographic regions:

resource "google_monitoring_uptime_check_config" "hostname" {
  for_each = toset(var.monitored_hostnames)

  project      = var.project_id
  display_name = "HTTPS: ${each.key}"

  http_check {
    path           = "/"
    port           = 443
    use_ssl        = true
    validate_ssl   = true
    accepted_response_status_codes { status_class = "STATUS_CLASS_2XX" }
  }

  monitored_resource {
    type = "uptime_url"
    labels = {
      host       = each.key
      project_id = var.project_id
    }
  }

  period  = "60s"
  timeout = "10s"

  selected_regions = ["USA_OREGON", "EUROPE", "ASIA_PACIFIC"]
}

An uptime check fails when TLS breaks, even if the cert is still in ACTIVE state in the Certificate Manager API. That’s the check that catches “the cert exists but the LB got misconfigured” class of incidents.

The Dashboard

One dashboard per platform team, importable as JSON. Panels:

  • Cert expiry countdown per cert (sorted ascending; the expiring-soon ones float to the top)
  • Backend request rate per hostname, stacked so drops are visible
  • Backend latency p99 per hostname
  • Uptime check success rate per hostname per region
  • ACME challenge failure rate (stays at zero)
  • DS record presence (monitored via a synthetic scheduled job)

The dashboard JSON lives in the module at modules/cert-monitoring/dashboard.json. Import via gcloud monitoring dashboards create --config-from-file=dashboard.json.

The Audit Script

Alerts catch regressions on known resources. What catches unknown resources is a cross-project audit. scripts/audit-certs.sh lists every cert in every project the caller has access to:

#!/usr/bin/env bash
# audit-certs.sh — cross-project cert inventory

set -euo pipefail

for PROJECT in $(gcloud projects list --format="value(projectId)"); do
  echo "=== $PROJECT ==="
  gcloud compute ssl-certificates list --project="$PROJECT" \
    --format="table(name,type,managed.status)" 2>/dev/null || true
  gcloud certificate-manager certificates list --project="$PROJECT" \
    --format="table(name,managed.state,managed.domains)" 2>/dev/null || true
done

Run it quarterly. Any cert showing up that isn’t traced back to the shared cert map or Private CA is a drift and should be either onboarded onto the module from Article 9 or deliberately exempted with documentation.

Runbook 1: Cert Rotation (General Wildcard)

For Google-managed certs on the shared LB. Usually nothing to do — auto-renewal handles it. But when you get the 30-day alert and the cert is still not renewing:

  1. Check cert state: gcloud certificate-manager certificates describe <name>. If FAILED, skip to step 4.
  2. Check DNS authorization: gcloud certificate-manager dns-authorizations describe. CNAME should be present.
  3. Verify CNAME propagation: dig +short CNAME _acme-challenge.<domain> @1.1.1.1.
  4. If anything is broken, fix it (usually the DS record at the parent, or the CAA policy) then force a renewal attempt by touching the cert resource in Terraform: terragrunt taint google_certificate_manager_certificate.wildcard && terragrunt apply.
  5. Expected result: cert moves PROVISIONING -> ACTIVE within 10 minutes. Close the alert.

Runbook 2: Cert Rotation (Private CA)

For the financial LB. Not auto-renewed. Manual on your rotation schedule (every 60-90 days):

  1. Decide: same-key rotation or new-key rotation. Default is same-key; new-key only if the previous key is due for retirement.
  2. Generate CSR if new key, reuse existing CSR if same key.
  3. Issue new cert: gcloud privateca certificates create ...
  4. Verify chain: openssl verify -CAfile root-ca.pem pay-new.pem
  5. If new-key rotation, verify client SPKI pins are updated across the fleet (see Runbook 3).
  6. Deploy the new cert to the LB via Terraform apply. Blue-green pattern from Article 4.
  7. Soak for 1 hour with traffic generator running.
  8. Retire old cert resource.

Runbook 3: SPKI Pin Update

For new-key rotations on the financial API. Coordinated with mobile/client team:

  1. Generate next keypair and CSR. Compute SPKI pin on next key.
  2. Ship client update adding the next pin to the approved set (alongside the current pin).
  3. Verify fleet adoption via client telemetry. Threshold: 95% of active installs running the new build.
  4. Execute new-key rotation (Runbook 2) now that backup pin is in place.
  5. After soak, ship client update removing the old pin.
  6. Generate the next-next keypair; cycle begins again.

Runbook 4: Emergency Cert Revocation

For compromise scenarios. Assumes the cert is actively issuing fraudulent traffic and has to stop now:

  1. Identify the cert: project, resource name, which LB(s) reference it.
  2. For Private CA certs: gcloud privateca certificates revoke <cert-id>.
  3. For Certificate Manager certs: no direct revoke API. Remove the cert from every cert map entry referencing it: terraform apply with the cert map entry deleted.
  4. Update CAA records to block further issuance from the compromised CA (if the compromise is at the CA, not just one cert).
  5. Cut over traffic to a backup cert/LB if one exists. If not, accept the outage until a new cert is issued and deployed.
  6. Post-incident: full post-mortem, review access paths that let the compromise happen, consider rotating every cert issued from the same CA in the last 30 days as precautionary.

Post-Mortem Template

For any cert incident. Copy this markdown file and fill in:

# Cert Incident Post-Mortem — <date>

## Incident Summary
- When: <timestamp start> to <timestamp end>
- Impact: <services affected, user impact, revenue impact>
- Severity: <P0/P1/P2/P3>

## Timeline
- <minute-by-minute narrative>

## Root Cause
- <the specific thing that failed>
- <why the monitoring didn't catch it earlier>

## Fix Applied
- <what was done to resolve>

## Corrective Actions
- [ ] <action 1, with owner and due date>
- [ ] <action 2>

## Lessons Learned
- <what changes in the runbooks>
- <what changes in the monitoring>
- <what changes in the architecture>

Post-mortems that don’t produce runbook or monitoring updates are wasted effort. Every incident is either a gap in the runbook (didn’t handle this scenario) or a gap in monitoring (should have caught this earlier). Identify which, fix the gap, close the loop.

The Migration Playbook (Full)

For teams applying this consolidation pattern to an existing project with pre-existing cert sprawl. Six phases:

  1. Phase 0: Audit. Run audit-certs.sh across every project. Catalog every cert. Identify ones that are safe to migrate and ones that need dedicated chains.
  2. Phase 1: DNS hardening. Enable DNSSEC on every zone. Add CAA records restricting issuance. Reversible, zero customer impact.
  3. Phase 2: Cert provisioning. Issue the shared wildcard via Certificate Manager. No traffic impact; the cert just sits there until attached.
  4. Phase 3: Pilot. Migrate one low-risk service (marketing site, documentation site) to the shared LB. Validate the pattern. Run for a week.
  5. Phase 4: Production migration. Migrate production services by domain cluster. One cluster per maintenance window. Blue-green LB swap per migration.
  6. Phase 5: Financial services separate track. Private CA provisioning, dedicated LB, SPKI pinning rollout on separate timeline from the general fleet.
  7. Phase 6: Cleanup. Destroy the per-service LBs and ManagedCertificates. This is the irreversible step; leave it for last.

Rollback at every phase except 6 is a Terraform revert. Phase 6 rollback is “reprovision the old sprawl pattern from scratch,” which nobody wants to do, which is why Phase 6 only happens when every preceding phase has soaked cleanly for at least 30 days.

Tying It All Back

This series started with a cost calculation: 30 services × 4 environments × $25/month = sprawl that bleeds roughly $3k/month before a single packet of real traffic moves. It ends with a durable pattern:

  • One shared LB for the general fleet, one cert map, one wildcard, zero per-service rotation work
  • One dedicated LB for financial services with Private CA + SPKI pinning and proper isolation
  • One cert inventory module + CI guardrails so the pattern doesn’t regress
  • One dashboard + five alert policies + four runbooks so rotations are boring
  • One demonstrably zero-incident rotation workflow, proved in the capstone

Every outcome is visible, tested, and captured in code or runbook form. The consolidation holds not because anyone is watching, but because the default path is the consolidated path and CI fails on deviation. That’s the step-change that makes certs stop being a quarterly fire drill.

Cleanup

The monitoring module, like the service-onboarding module, is durable infrastructure. Leave it in place. The audit script lives in the demo repo. The runbooks live in the platform-docs repo (wherever your team keeps them; they do NOT live in private task-trackers because on-call needs to find them fast).

When the lab itself tears down at series end: ./scripts/session-down.sh. Everything provisioned across the 11 articles of this series goes with it, except the Cloudflare NS delegation which gets removed manually.

The series ends here. The demo repo at github.com/cfg-labs/gcp-shared-traffic-demo stays public and versioned per article; the modules in infra-gcp/modules/ stay maintained. If Google Cloud ships a new Gateway API class, a new Certificate Manager feature, or a new CA Service tier, the series gets amended — not rewritten.

Related Articles

Automation Validate CloudFormation Templates with cfn-lint and cfn-nag Cloud Internet of Things: Benefits and Risks Involved AWS How To Mount AWS EFS File System on EC2 Instance Automation Manage VM Instances on Hetzner Cloud using hcloud CLI

Leave a Comment

Press ESC to close