incident-responserunbookoutages

Emergency TLS Response: What to Do When a Major CDN or Cloud Goes Down

UUnknown

2026-02-05

10 min read

Runbook-style emergency TLS actions for domain admins: DNS-01 issuance, rerouting traffic, and recovering from CDN/cloud outages in 2026.

Emergency TLS Response: Runbook for domain admins when a major CDN or cloud provider goes down

Hook: When X, Cloudflare, or AWS suffer a regional outage, your users notice downtime — and often your TLS/TCP stack breaks first. This runbook gives domain admins and DevOps teams step-by-step actions to recover encrypted traffic quickly, reissue certificates using DNS‑01 when the CDN or cloud control plane is unavailable, and reroute traffic without creating new certificate failures.

Why this matters in 2026

Late 2025 and early 2026 saw a spike in large CDN and cloud control-plane incidents that amplified two realities: teams rely heavily on managed TLS termination and DNS vendors, and single-provider failures cascade into certificate outages. The industry response has been to embrace multi-DNS and pre-delegation patterns, DNS‑01 emergency issuance workflows, and automated failover logic. This runbook distills those tactics into repeatable playbooks you can execute during an incident.

Incident overview — detect, scope, and decide

Immediate detection

Alert sources: SRE pager, RTT and synthetic checks, certificate monitoring alerts (expiry or OCSP failures).
Quick verification: try an external TLS handshake and DNS queries from different networks.

# quick TLS probe
openssl s_client -connect example.com:443 -servername example.com -brief

# check OCSP stapling
openssl s_client -connect example.com:443 -servername example.com -status

# DNS TXT for ACME
dig +short TXT _acme-challenge.example.com @8.8.8.8

# layered check: curl from external vantage
curl -I https://example.com --resolve example.com:443:203.0.113.5 -v

Scope the outage

Is the outage: CDN edge, CDN control plane (DNS/portal/API), DNS provider, or origin cloud? Use traceroute, dig, and CDN status pages.
Are issues global or regional? Use public probing services (e.g., RIPE Atlas, reserved probes) and multiple DNS resolvers.
Check certificate symptoms: expired, absent, handshake errors, or successful handshake but broken backend connections.

Decision matrix (fast)

If the CDN control plane / DNS provider is down and you rely on their API for DNS-01 challenges — activate DNS emergency issuance via preconfigured fallback providers or TXT CNAME delegation.
If the CDN is down but DNS is fine — consider temporally bypassing the CDN by switching A/AAAA/ALIAS records to origin IPs or to alternative CDNs with low TTLs.
If certificates are expired or will expire during the outage — prioritize emergency DNS‑01 issuance and short-lived certs to restore TLS before traffic reroute.

Pre-incident preparation (what you should do now)

These actions are most valuable when implemented before an incident. Treat them like choreographed drills.

Centralize ACME credentials and multiple DNS API keys in a secrets store (Vault, AWS Secrets Manager, etc.). Have at least two DNS providers configured for DNS‑01 automation.
Pre-delegate ACME challenge records using CNAME delegation: create a stable CNAME _acme-challenge.example.com → acme-01.example-acme-ext.net. Manage TXT records at the delegated zone for emergency issuance.
Keep origin TLS certs ready — public certs (Let's Encrypt wildcard via DNS‑01) or an internal PKI HOT spare. Do not rely solely on CDN-origin-only certs.
Set low TTLs for critical A/AAAA and ALIAS records (60–300s) when you expect rapid failover windows, and longer TTLs for static assets.
Automate health checks in your DNS provider for automatic failover; run playbook tests monthly. Consider integrating observability and predictive micro-hub patterns from edge-assisted observability playbooks to spot regional degradation early.
Document: one-page runbook and pre-staged scripts to add/remove TXT records, issue certs with acme clients, and update DNS quickly. Use an incident response template to keep your cheat-sheet consistent across teams.

Runbook: Step-by-step recovery

Step 0 — Assemble the incident team

Operators, DNS owners, security lead, and on-call devs.
Open a shared incident channel (chat/bridge) and a real-time log stream (tail webserver logs and ACME client logs).

Step 1 — Verify certificate failure mode

Use openssl and curl to classify the error — expired, no cert, wrong SAN, revoked, or handshake cipher mismatch.
Example commands:

# expired / not-present
openssl s_client -connect example.com:443 -servername example.com

# check certificate details
echo | openssl s_client -connect example.com:443 -servername example.com 2>/dev/null | openssl x509 -noout -text | sed -n '1,120p'

Step 2 — Emergency DNS‑01 issuance (fast path)

When your primary DNS provider/API is unavailable, but you've pre-configured a fallback DNS provider or CNAME delegation, you can issue new certificates using DNS‑01. The goal is to get a trusted certificate on the origin so clients talking directly to the origin (bypassing the CDN) see a valid cert.

Option A — Use fallback DNS provider with API

Switch your ACME client to use the fallback DNS plugin and the stored API key in your secrets manager.
Issue wildcard/public cert with your ACME client (acme.sh, Certbot with DNS plugin, lego, or cert-manager in Kubernetes).

# example with acme.sh and Cloudflare plugin (replace dns provider plugin as needed)
export CF_Token="${VAULT_CF_TOKEN}"
acme.sh --issue --dns dns_cf -d example.com -d "*.example.com"
acme.sh --install-cert -d example.com \
  --key-file /etc/ssl/private/example.key \
  --fullchain-file /etc/ssl/certs/example.pem

Option B — Use CNAME delegation for TXT records (recommended pre-configured)

Many teams pre-create a CNAME so the ACME TXT record points to a zone you control on an independent provider. During an outage, publish TXT under that delegated zone and run ACME locally.

# verify delegated CNAME
dig +short CNAME _acme-challenge.example.com

# then ensure TXT exists in delegated zone
dig +short TXT _acme-challenge.example.com.delegate-provider.example.net

After adding TXT to the delegated zone, run the ACME client using --manual or with the delegated provider plugin.

Option C — Manual DNS with Certbot (if APIs are unavailable)

# Certbot manual DNS flow
certbot certonly --manual --preferred-challenges dns \
  -d example.com -d "*.example.com" --agree-tos --no-bootstrap

# follow prompts to place TXT under _acme-challenge.example.com

Step 3 — Update origin server TLS

Once you have a valid certificate for the hostname(s), install it on your origin. If the origin previously relied on CDN for TLS, ensure your web server is configured to present the new cert when clients connect directly.

# example nginx snippet
server {
  listen 443 ssl;
  server_name example.com;
  ssl_certificate /etc/ssl/certs/example.pem;
  ssl_certificate_key /etc/ssl/private/example.key;
  ssl_protocols TLSv1.2 TLSv1.3;
}

Step 4 — Redirect traffic: bypass CDN safely

Now that the origin can serve a valid cert, reroute traffic. Choose the least disruptive method:

DNS A/AAAA switch: Point example.com to origin IP(s) or the failover CDN. Use pre-set low TTLs to minimize propagation delay.
ALIAS/ANAME: If your DNS provider supports ALIAS to origin load balancer IPs, update that. Beware of providers whose control plane is affected by the outage.
Temporary secondary CDN: If you have a secondary CDN, switch CNAME/ANAME to the secondary CDN endpoint and ensure origin accepts TLS (SNI) from that CDN.

Example: change A record using a scripted provider API (pseudo):

# use provider CLI to set A record to origin
provider-cli dns update --zone example.com --name example.com --type A --value 203.0.113.55 --ttl 60

Step 5 — Validate end-to-end

From several outside vantage points, perform TLS handshakes and HTTP requests.
Check certificate chain, SANs, OCSP stapling, and HSTS headers.
Monitor logs for handshake errors and 5xx responses; feed key signals into your SRE dashboards and post-incident metrics collectors.

# TLS and HTTP verification
openssl s_client -connect example.com:443 -servername example.com -status
curl -I https://example.com --resolve example.com:443:203.0.113.55 -v

Troubleshooting common failure modes

ACME challenge failed

Symptoms: ACME client reports challenge timeout or TXT not found.
Fixes: confirm TXT exists at authoritative servers; check TTLs and propagation using multiple resolvers; look for CNAME chain issues or wildcard conflicts.
Commands: dig +trace, dig @, check your DNS provider's change history.

Rate limits reached

Let's Encrypt enforces rate limits (new-cert, duplicate-cert, failed validations). In an emergency, prioritize wildcard or multi-SAN certs to reduce calls.
If you hit a rate limit, use existing certificate (extend TTLs on DNS if needed) or consider buying a short-term certificate from a commercial CA if business-critical.

Handshake failures after reroute

Check SNI mismatch: client-sent SNI must match the cert SAN.
Confirm server presents the newly installed cert (openssl s_client -servername ...).
If the CDN previously rewrote requests or header expectations, ensure origin accepts traffic without the CDN's headers.

OCSP/Stapling issues

Symptoms: browsers warn about revocation or stapling failure.
Fixes: ensure your web server fetches OCSP responses and staple them. If stapling fails because of egress blocks, allow URL fetches to CA OCSP responders.

Post-incident actions and hardening

Document exact timeline, decisions, and what worked.
Rotate API keys used during the incident if they were exposed in chat/history.
Implement more resilient DNS architecture: multi-DNS providers, pre-delegated ACME CNAMEs, automated failover health checks, and synthetic ACME tests.
Integrate certificate monitoring into on-call tooling (alerts for expiry, CT log anomalies, OCSP failures).
Run tabletop drills quarterly to practice emergency DNS‑01 issuance and reroute flows; pair drills with an incident response template to capture decisions quickly.

Advanced strategies & 2026 trends

Expect these operational shifts to be standard by the end of 2026:

ACME brokers and delegated issuance: centralized teams run ACME brokers that issue certs across clouds. They multiplex through multiple DNS providers to avoid single-point failures; governance and decision planes for these flows are covered in edge auditability & decision planes.
Pre-delegated TXT/CNAME patterns for rapid DNS‑01 issuance became a best practice after late‑2025 outages. It allows issuing certs even when the primary DNS API is down.
Certificate orchestration as code: Terraform modules and GitOps flows to manage DNS delegations, credentials, and ACME clients reproducibly — this automation intersects with serverless and edge microhub patterns in serverless data mesh thinking.
Edge-less TLS readiness: more teams keep a publicly trusted origin certificate active so the origin can handle TLS if the edge is offline. Consider hosting origin-capable artifacts on pocket edge hosts or other small edge platforms for resilience.

Pro tip: A running origin with a valid public certificate is the fastest way to restore HTTPS access — treat that cert as your emergency lifeline.

Quick reference checklist (copy this into your incident channel)

Verify failure mode: expired / missing / handshake error.
If primary DNS/API unavailable — use fallback DNS or CNAME-delegation path.
Issue DNS‑01 cert (acme.sh / certbot / cert-manager) and install on origin.
Switch DNS A/AAAA/ALIAS records to origin or secondary CDN (use low TTLs).
Validate via openssl and external curl; monitor logs.
Postmortem, rotate keys, drill improvements.

Appendix: Useful commands and cert-manager tips

cert-manager (Kubernetes) emergency flow

If cert-manager is configured with a DNS provider that's down, you can create a temporary ClusterIssuer pointing at a fallback DNS provider and create a Certificate resource to force issuance:

# create fallback ClusterIssuer (snip)
kubectl apply -f fallback-clusterissuer.yaml

# create Certificate
kubectl apply -f emergency-certificate.yaml

# watch events
kubectl describe certificate example-com

Helpful commands

# DNS: trace the delegation
dig +trace example.com

# authoritative TXT check
dig +short TXT _acme-challenge.example.com @ns1.fallback-dns.example

# TLS: check certificate chain
openssl s_client -connect example.com:443 -servername example.com -showcerts

# quick HTTP check
curl -I https://example.com --resolve example.com:443:203.0.113.55 -v

Final recommendations

Outages affecting major CDNs and cloud providers are inevitable — the 2025–2026 incident patterns made that clear. Your ability to recover quickly depends less on a single tool and more on pre-planned redundancy: multiple DNS providers, delegated ACME challenge locations, pre-issued origin certificates, and automated playbooks. Treat emergency DNS‑01 issuance and safe reroute a core SRE capability.

Actionable takeaways:

Implement CNAME delegation for _acme-challenge and a fallback DNS provider today.
Keep a publicly trusted origin cert (wildcard or SAN) ready for emergency use.
Script and test your entire chain monthly: issue, install, reroute, verify.

Use this runbook as your incident playbook and iterate it after each drill or outage.

Call to action

Download and embed this runbook in your on-call documentation, run a failover drill this week, and sign up for our incident checklist updates to get pre-built scripts for ACME DNS‑01 issuance across multiple DNS providers. If you need a tailored runbook for your architecture, reach out to the LetsEncrypt.xyz Ops team for a readiness review.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.