dnsautomationoutage

Designing ACME Validation for Outage-Prone Architectures: DNS-01 Strategies and Pitfalls

UUnknown

2026-01-22

9 min read

Design resilient DNS-01 automation to survive Cloudflare/AWS outages: delegation, acme-dns, multi-provider failover, and CI/CD patterns for reliable renewals.

Hook: Stop chasing renewals during provider outages

Outages at major providers like Cloudflare, AWS, and the social platforms that feed traffic spikes have a nasty side effect: failed ACME renewals. If your automation relies on HTTP-01 checks that terminate at a CDN or single DNS/edge provider, a provider outage can cause expiry and downtime. This guide contrasts HTTP-01 and DNS-01 in the context of large-provider outages (late 2025 and January 2026 incidents), then walks you through practical, production-ready DNS-01 automation patterns, failover architectures, CI/CD integration, and troubleshooting.

Why this matters in 2026

Late-2025 and early-2026 outages affecting Cloudflare, AWS, and other major networks have reminded teams that edge dependence increases blast radius for both application traffic and certificate validation. In 2026, ACME automation is mainstream and teams increasingly adopt DNS-01 for reliability, wildcard certificates, and multi-tenant setups. At the same time, DNS APIs, provider SLAs, and DNS delegation features have matured — enabling robust failover patterns if you design for it.

High-level comparison: HTTP-01 vs DNS-01 during provider outages

HTTP-01 — pros and single-point-of-failure cons

Pros: Simple. Works with standard web servers and CDNs. Automated by most ACME clients out of the box.
Cons: ACME servers must reach your web endpoint. If your CDN or origin is down, probes fail. Edge proxying (Cloudflare’s orange-cloud) can obscure challenges or fail during provider outages.
Real-world failure: during the Jan 2026 outage waves, many teams using HTTP-01 with edge providers saw renewals fail because Let's Encrypt validation requests couldn't reach the expected resources.

DNS-01 — resilience and operational trade-offs

Pros: Decouples validation from web traffic — certificate issuance only requires writing a TXT record to DNS. Wildcards require DNS-01. With API access or delegation, DNS-01 can succeed even if the site or CDN is down.
Cons: Requires DNS API access or control of authoritative name servers. DNS propagation and TTL management add complexity. Misconfigured delegation or DNSSEC can break validation.

Core strategies for resilient DNS-01 automation

Below are battle-tested architectures you can apply depending on your constraints.

1) Multi-provider authoritative DNS for _acme-challenge (delegation model)

Delegate the _acme-challenge.example.com zone to a provider you control for TXT records. This isolates ACME TXT management from your primary DNS provider.

Create a secondary DNS account (Provider B) capable of quick API updates.
In your primary DNS (Provider A), create NS records for _acme-challenge.example.com that point to Provider B's name servers.
Use your automation (CI/CD, cert-manager, or scripts) to update TXT records via Provider B's API.

Benefits: even if Provider A or its edge is down, ACME will query the delegated NS records and find TXT values at Provider B. This is a common pattern used by SRE teams to separate certificate validation from primary DNS change processes.

2) Redundant DNS providers with synchronized records

Maintain identical TXT records across two providers simultaneously using orchestration tools (octoDNS, Terraform, or custom scripts). On renewal, write the TXT to the primary; if it fails, try the secondary.

Use Terraform or octoDNS to keep authoritative state in Git.
Write idempotent scripts that create TXT on provider A and then provider B.
Monitor API error rates and propagate alerts to on-call if both fail — tie this into observability and runbook automation.

3) acme-dns / dedicated DNS microservice

Run an acme-dns server (https://github.com/joohoi/acme-dns) and delegate a small zone to it. Your ACME client stores credentials with acme-dns and updates TXT entries locally. This model reduces external provider dependency — you own the service that answers ACME queries.

Delegate _acme-challenge.example.com NS to your acme-dns server(s).
Expose a small authenticated API for creating challenge records; integrate with certificates automation tools.
Run acme-dns in multiple regions to survive datacenter outages; consider networking and edge failover design.

4) Dynamic provider failover via CI/CD (API-first approach)

When delegation is not possible, implement a failover routine in the automation pipeline: try Provider A API, on error fallback to Provider B, then trigger ACME validation.

Operational rule: Always verify propagation (via authoritative queries) before asking the ACME CA to validate.

Step-by-step: a resilient DNS-01 renewal workflow

Below is a practical workflow combining best practices. It assumes you have API tokens for Cloudflare (primary) and Route 53 (secondary), and that you use acme clients that support manual hook scripts (certbot, acme.sh, or lego).

Prerequisites

API tokens with least privilege stored in a secrets manager (GitHub Secrets, Vault, AWS Secrets Manager).
CI runner or automation host with network reach to both provider APIs.
An ACME client that supports DNS hook scripts (acme.sh, certbot with dns plugins, lego).

Automation script (pseudocode)

Key logic: attempt primary, verify via authoritative dig, fallback to secondary, then call the ACME client to complete validation.

# simplified: dns-01-failover.sh
PRIMARY=cloudflare
SECONDARY=route53
DOMAIN=example.com
TOKEN_PRIMARY=/secrets/cloudflare_token
TOKEN_SECONDARY=/secrets/aws_token
TXT_NAME="_acme-challenge.$DOMAIN"
TXT_VALUE="$1"  # value provided by ACME client hook

# function to add TXT to Cloudflare
add_primary() {
  # call Cloudflare API to create TXT; returns 0 on success
}

add_secondary() {
  # call Route53 API to create TXT
}

check_propagation() {
  # query authoritative NS for TXT_NAME and ensure TXT_VALUE present
}

# try primary
if add_primary; then
  if check_propagation; then
    exit 0
  fi
fi

# primary failed, try secondary
if add_secondary; then
  if check_propagation; then
    exit 0
  fi
fi

# if we reach here, both failed
exit 1

Integrating with Certbot (example)

Certbot supports --manual hooks, but it's better to use provider plugins when available. If you need custom failover, use --manual and the script above in --manual-auth-hook and --manual-cleanup-hook.

certbot certonly --manual \
  --preferred-challenges dns \
  --manual-auth-hook /opt/certs/dns-01-failover.sh \
  --manual-cleanup-hook /opt/certs/dns-01-cleanup.sh \
  -d example.com -d '*.example.com'

When using certbot DNS plugins (dns-cloudflare, dns-route53), you can still orchestrate failover by attempting one plugin then the other in your pipeline.

CI/CD integration patterns (GitHub Actions example)

Automate renewals in CI to centralize API credentials and provide auditing.

name: renew-cert
on:
  schedule:
    - cron: '0 2 * * *'  # run nightly
jobs:
  renew:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v4

      - name: Install acme.sh
        run: curl https://get.acme.sh | sh

      - name: Run renew with failover
        env:
          CF_API_TOKEN: ${{ secrets.CF_TOKEN }}
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_KEY }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET }}
        run: |
          /opt/certs/dns-01-failover.sh prepare
          ~/.acme.sh/acme.sh --issue --dns --domain example.com

Store API tokens in the Git provider’s secret store and rotate them periodically. Tie CI logs into your observability platform to spot API degradation early and feed alerts into your runbooks and playbooks (see documentation tooling for runbook templates).

Architectural patterns & trade-offs

Delegate only the ACME subdomain

Why: Minimal blast radius and quick TXT updates without touching main DNS. You can give a third party or internal microservice limited NS control.

Full multi-authoritative approach

Synchronize entire zone across providers. Harder to maintain (and more expensive) but automates failover for everything.

acme-dns microservice

Low operational complexity after initial setup. You own the lifecyle and can horizontally scale acme-dns across regions. Watch for DNSSEC implications and ensure monitoring on the acme-dns instances.

Operational best practices

Test renewals regularly: schedule test certs (staging ACME) and run full validation drills. Treat certificate renewal like on-call runbook practice.
Verify authoritative propagation: query authoritative NS directly (dig @ns1.providerB _acme-challenge.example.com TXT) rather than relying on recursive caches.
Short TTL for ACME records: set low TTLs for TXT records used for validation, but be cautious — very low TTLs can cause higher query load on providers and might hit rate limits.
Least privilege API tokens: grant only DNS edit permissions for the specific zone. Use secrets manager and rotate tokens on schedule.
Alert and escalate: if a primary and secondary provider both fail or propagation lags, alert on-call before expiry windows shrink below your renewal margin.
Handle rate limits: Let’s Encrypt and DNS providers have rate limits. Use the staging environment for tests and spread renewal operations to avoid bursts.

Troubleshooting checklist

Validate API call success: inspect provider API responses and logs.
Check authoritative answers: dig +nssearch and dig @ TXT to confirm TXT presence.
DNSSEC: when delegating, ensure DNSSEC signatures are valid or disable DNSSEC for the delegated subdomain.
Content mismatch: confirm ACME client and TXT writing service use identical values (no base64 line breaks or whitespace issues).
Propagation delays: increase polling interval and total timeout for slow providers.

Case study: surviving a Cloudflare outage (anonymized)

One fintech team used Cloudflare as primary DNS and edge. During a late-2025 outage they lost HTTP-01 renewals and the Cloudflare API was partially degraded — creating TXT records failed intermittently. They implemented a two-step fix:

Delegated _acme-challenge to Route 53 and ran a lightweight acme-dns proxy; renewals immediately succeeded during Cloudflare edge instability.
Added an automated failover script that attempted Cloudflare first and Route 53 second, plus staged alerts if both failed.

Outcome: certificates continued to renew through outages, and the team eliminated manual interventions during two subsequent provider incidents in early 2026.

2026 trends and future-proofing

Increased adoption of DNS-01 for multi-tenant platforms and automated wildcard TLS for ephemeral workloads.
More mature DNS provider APIs and standardization (GraphQL and stable REST endpoints) are simplifying automation, but architectural resilience (delegation and multi-provider) remains necessary.
Expectation of more frequent edge outages means teams must decouple validation from the web path — DNS-01 and acme-dns patterns will be standard practice for high-uptime platforms.

Security and compliance notes

Use least-privilege tokens, secure storage (HashiCorp Vault or cloud provider secrets), and audit all API activity. If your environment requires logging of certificate changes for compliance, centralize records in your CI/CD pipeline and retain issuance metadata for audits.

Final checklist before you go live

Have at least two routes to write TXT records (delegation or dual providers).
Store provider credentials securely and rotate regularly.
Run staged renewals against Let’s Encrypt staging before production pushes.
Implement authoritative-propagation checks in your automation.
Document runbooks and schedule renewal drills at least quarterly. Use documentation tooling like Compose.page to version and publish runbooks.

Key takeaways

DNS-01 decouples certificate validation from web traffic and is the resilient choice for outage-prone architectures.
Use delegation, acme-dns, or multi-provider strategies to mitigate provider outages.
Integrate DNS-01 automation into CI/CD with least-privilege secrets, propagation checks, and fallback logic.
Test, monitor, and practice renewals — automation fails silently if you do not validate it under failure conditions.

Call to action

Start by running a staged DNS-01 renewal today: delegate _acme-challenge to a secondary provider or spin up acme-dns, then integrate a failover-aware script into your CI pipeline. If you want a ready-made checklist or a reviewed pipeline template for your stack (Cloudflare, Route 53, Kubernetes cert-manager, or GitHub Actions), request our automation playbook and runbook review — we'll help you eliminate certificate-related downtime.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.