k8scert-managercloud

Hardening Containerized ACME Renewals in Kubernetes During Cloud Provider Outages

UUnknown

2026-02-03

11 min read

Practical patterns to keep Kubernetes cert-manager DNS-01 renewals working during cloud outages: HA cert-manager, acme-dns, multi-provider DNS, and multi-cluster secret replication.

When cloud DNS or CDNs go dark: your certs shouldn't

Outages at Cloudflare, AWS, and major provider incidents in late 2025 and early 2026 proved a hard lesson for platform teams: TLS renewal is only as resilient as the DNS and control plane you depend on. If cert-manager can't complete a DNS-01 challenge because the DNS provider is offline, certificates can expire and cause cascading service failures.

Why this matters now (2026): threat landscape and trends

In 2026, teams increasingly deploy ephemeral workloads across multi-cloud and multi-cluster topologies. Adoption of ACME-based automation (Let’s Encrypt and others) is near-ubiquitous for production workloads. That combination—short-lived certs (90 days) plus distributed platforms—means renewal automation must resist outages, not just assume continuous upstream availability.

Recent outages (Jan 2026 reports showing spikes across X, Cloudflare, and AWS) highlighted two trends:

Single-provider DNS or single-region cert management creates single points of failure for TLS.
Automation operators often bind cert issuance tightly to a single control plane (one cert-manager instance, one DNS provider), so a provider outage blocks renewals.

What you'll learn

This guide shows pragmatic, actionable patterns to keep Kubernetes TLS renewals working during cloud provider outages. You’ll get:

Durable cert-manager deployment and leader election practices for high availability
DNS-01 multi-provider and multi-cluster fallback patterns, including acme-dns and TXT replication strategies
Secrets replication, cross-cluster certificate cache strategies, and safe renewBefore tuning
Monitoring, runbooks, and automation tactics to survive partial outages (observability patterns linked below)

1) Harden cert-manager itself

Run it highly available across zones and control planes

Make cert-manager a multi-replica, anti-affinitized, PDB-backed deployment. That reduces impact from node/zone failures and rolling updates.

Key settings:

replicas: 3 or 5 depending on cluster size
Pod anti-affinity to spread replicas across zones
PodDisruptionBudget to avoid losing leader quorum during node drains
Always enable leader election (cert-manager supports leader election using Kubernetes Lease objects)

Example: deployment flags and PDB

# partial snippet to illustrate intent
containers:
- name: cert-manager
  args:
    - --v=2
    - --leader-election-namespace=kube-system
    - --leader-elect
    - --leader-election-resource-name=cert-manager-leader

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: cert-manager-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: cert-manager

Tune logging and probe settings; cert-manager exposes Prometheus metrics (Certificate/CertificateRequest errors) you should scrape.

Make RBAC and API access durable

Leader election depends on API availability. Ensure control-plane HA (multi-master control plane) and appropriate RBAC (Lease and ConfigMap permissions) are present. If you run multiple clusters, consider a dedicated control-plane cluster for certificate orchestration (see multi-cluster section).

2) DNS-01 multi-provider strategies

DNS is the Achilles’ heel for DNS-01 validation. If your DNS provider is unavailable, ACME cannot verify ownership. There are multiple patterns to mitigate this risk—pick one (or combine):

Pattern A — Dual-hosted zone (dual DNS providers)

Host your zone with two authoritative providers (e.g., Cloudflare + Route 53). If Provider A has an outage, Provider B still serves the zone and TXT changes can propagate. This requires:

Providers that support secondary/primary or zone replication, or manual automated synchronization tools (octoDNS, dnscontrol).
Careful NS management (some TLDs limit NS delegation). For subdomains you control, delegate the _acme-challenge subdomain across multiple providers.

Pros: simple for large zones. Cons: replication lag and complexity when CRUD operations fail in one provider.

Pattern B — CNAME to acme-dns (recommended for many teams)

Use acme-dns (small API that serves TXT records) and point _acme-challenge.example.com CNAME to an acme-dns domain. cert-manager writes to acme-dns's API; the authoritative DNS serves the CNAME. To survive provider outages, run acme-dns clusters in multiple regions or providers with a replicated backend (DynamoDB, Consul, or PostgreSQL with multi-region replication).

Architecture highlights:

Authoritative DNS holds a stable CNAME; it rarely changes, so it’s less fragile during upgrades
ACME verification hits the acme-dns target to read TXT entries; if one acme-dns instance is down, a replicated endpoint still serves the TXT

Pattern C — TXT replication across providers

When cert-manager creates the TXT record against Provider A, trigger a replication job that writes the same TXT to Provider B. Implement via:

An admission webhook or controller watching CertificateRequests
Integration with octoDNS or external-dns (external-dns can manage TXT in some setups) to write to multiple providers

This pattern provides immediate fallback if Provider A stops accepting changes (Provider B already has the TXT). The challenge is ensuring replication latency is acceptable for ACME propagation windows; consider an interoperable verification approach for multi-provider consistency.

Example: cert-manager ClusterIssuer with CNAME-to-acme-dns

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-dns
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: ops@example.com
    privateKeySecretRef:
      name: letsencrypt-account-key
    solvers:
    - dns01:
        cnameStrategy: Follow
        acmeDNS:
          url: https://acme-dns.example.net
          # cert-manager's acme-dns config depends on installed webhook

Note: cert-manager uses DNS provider webhooks for DNS-01. acme-dns is an intermediate pattern that reduces the number of providers cert-manager must speak to directly.

3) Multi-cluster failover patterns

When one cluster loses egress or the DNS APIs in a region are unavailable, other clusters can still renew certificates if you design the system intentionally.

Pattern: leader cluster + passive clusters with shared certificate cache

Designate one cluster (or a central control plane) as the certificate authority manager. It runs cert-manager and performs renewals. After issuance, the certificate secrets are replicated to all clusters. During leader cluster outage, a passive cluster can be promoted to leader or serve TLS from its cached certificate until renewal is possible.

Replication targets: HashiCorp Vault, S3 with KMS, or GitOps secrets operator (sealed-secrets or SOPS) for secure distribution
Keep an audit trail and strict RBAC on who can promote clusters

Example workflow to replicate cert secrets to multiple clusters

cert-manager issues certificate in control cluster as Kubernetes Secret
A controller exports the secret to Vault or S3 (encrypted) and writes metadata (expiration)
Passive clusters run a sync job that pulls the secret and writes it as a local Secret in the same namespace as the ingress controller
If the control cluster is down during renewal windows, passive clusters use cached certs until a manual promotion or automated failover triggers an issuance job when DNS is available

When to promote automatic vs manual

Automatic promotion is risky if DNS is flaky—don’t blindly create new CertificateRequests against a provider that is partially failing (you may hit rate limits). Safer pattern: automatic promotion with an approval gate when time-to-expiry < X days (configurable alert), or manual promote when outages exceed a threshold.

4) Tuning renewals for resilience (safe renewBefore)

cert-manager's Certificate resource has a renewBefore field. Default renewal starts 30 days before expiry. To build resilience:

Increase renewBefore to 45–60 days for critical domains. This gives a wider window to retry when DNS or providers are intermittent.
Beware of abuse: extremely early renewals multiply requests and may hit ACME rate limits. Track ACME rate limits (Let’s Encrypt) and use staging for test certs. See guidance on reconciling vendor SLAs and outage windows for renewal planning: From Outage to SLA.

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: example-cert
spec:
  dnsNames:
  - example.com
  secretName: example-cert-tls
  renewBefore: 720h # 30 days = 720h; increase to 1440h for 60 days

Adjust as part of a broader resilience plan—if you host many domains, test the aggregated rate against Let's Encrypt policies.

5) Observability, alerting, and runbooks

What to monitor

Certificate age and days-to-expiry (alert at 21/14/7/3/1 days)
CertificateRequest and Order failures in cert-manager (metric: certmanager_certificate_request_* and certmanager_acme_order_* )
ACME challenge failures by solver and provider
DNS provider API errors and latency (instrument provider SDK or use synthetic tests)

Sample alerts

Alert: Certificate expires in < 14 days and last renewal attempt failed > 3 times → P1
Alert: cert-manager ACME challenge errors > 5/min for domain suffix → investigate DNS provider
Alert: cert-manager leader lease transferred or leader restarts frequently → investigate control plane stability

Runbook snippets: DNS outage during renewal

Check cert-manager metrics and CertificateRequest events for error codes (DNS timeout, 5xx, rate limit)
Check DNS provider status pages (Cloudflare/AWS) and your provider incident feeds
If provider degraded: confirm TXT replication status to secondary provider or confirm acme-dns replication health
If no replication exists: export current certificate from control cluster and distribute to edge clusters to cover until DNS is healthy
Record the incident and adjust renewBefore or add replication as a remediation

6) Avoid rate limits and validation pitfalls

During outages teams sometimes retry aggressively and hit ACME rate limits. Mitigate by:

Exponential backoff for failed CertificateRequests
Staging environment usage for testing changes
Centralized issuance for high-volume domains, avoid parallel uncontrolled renewals

7) Real-world resiliency patterns and case studies

Below are concise, experience-driven patterns that worked for platform teams during the 2025–2026 outage waves.

Case: Media company (multi-region edge cache)

Problem: Region A’s DNS vendor experienced a control-plane outage at renewal time. Their platform used cert-manager in each edge cluster, all relying on the same DNS API.

Fix implemented:

Migrated to CNAME → acme-dns with acme-dns backend replicated to DynamoDB with cross-region reads
Set renewBefore to 45 days for their high-traffic domains
Implemented secrets sync from a central cluster to edge clusters via S3 KMS

Outcome: During subsequent outages renewals continued because the TXT records were served by acme-dns replicas, and the cached certificates avoided consumer impact.

Case: Enterprise SaaS (multi-cloud control plane)

Problem: One cloud provider’s IAM changes blocked cert-manager from updating Route 53 entries mid-renewal window.

Fix implemented:

Created two ClusterIssuers—one per DNS provider—and an orchestrator that chooses the provider based on health checks
On provider outage, orchestrator promotes the fallback ClusterIssuer only when TTL and expiry metrics cross thresholds

Outcome: No certificate expirations in the next 12 months. The orchestrator also logged failed attempts for post-mortem.

8) Implementation checklist (practical and actionable)

Run cert-manager with 3+ replicas, anti-affinity, and PDBs. Verify leader election role.
Pick a DNS-01 multi-provider strategy (dual-hosted, acme-dns CNAME, or TXT replication). Prototype in staging.
Implement secrets replication (Vault or encrypted S3) and test cross-cluster secret restore path.
Tune renewBefore for critical domains (45–60 days) and document ACME rate limits you might hit.
Instrument cert-manager metrics and create alerts for CertificateRequest failures and low days-to-expiry.
Build a runbook for DNS provider outage with scripted secret export/import and manual promotion steps.
Run fire-drills twice a year: simulate DNS provider outage and practice manual promotion and cert recovery.

Troubleshooting notes and gotchas

Rate limits: If you see HTTP 429 from ACME, pause automatic retries and switch to a staging issuer for tests.
DNS TTLs: Low TTL helps propagation but increases DNS API churn. Set a balanced TTL for challenge subdomains.
CNAME chains: Some resolvers limit CNAME depth; keep CNAME chains short for acme-dns patterns.
Clock skew: ACME validation is sensitive—ensure NTP sync on cert-manager nodes and control planes.
Secret leakage: When replicating secrets, use encryption-at-rest and strict IAM roles. Treat private keys as high-value secrets.

Future-proofing: what to watch in 2026 and beyond

In 2026 platform teams should watch two important trends:

Increased multi-provider DNS tooling. Tools for dual-hosted zones and DNS replication matured in 2024–2025; expect managed-secondary DNS offerings to become mainstream in 2026.
Edge CA and short-lived cert orchestration. More teams will rely on centralized cert services that act as internal CAs for short-lived mTLS and TLS certs; ensure your automation can interoperate with ACME and internal CA APIs.

Final recommendations: think availability, not just automation

Automation alone is not resilience. Treat certificate issuance and DNS as a distributed system with redundancy, monitoring, and explicit failover policies. The tactical steps above reduce blast radius when a cloud provider has a partial outage—the goal is to prevent TLS from becoming a single point of failure.

Actionable takeaways

Make cert-manager HA: replicas, anti-affinity, PDB, and leader election.
Remove single-provider DNS dependence: use CNAME → acme-dns, dual-hosted zones, or TXT replication.
Replicate issued secrets to passive clusters via encrypted central store.
Tune renewBefore to widen renewal windows for critical domains, but monitor ACME rate limits.
Monitor and runbooks: alert early and practice outage drills.

Call to action

If you manage Kubernetes TLS at scale, start by running a renewal fire-drill this quarter: simulate a DNS provider outage, exercise your failover path (acme-dns or secondary provider), and verify certificate continuity across clusters. If you’d like, download our checklist and sample Helm charts to implement the patterns in this article—deploy them first in staging, iterate, then roll to production.

Durable cert automation is a system design problem. Reducing provider blast radius, adding replication for TXT records, and operationalizing failover are what move you from “automated” to “resilient.”

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.