sreobservabilitypostmortem

When Certs Fail During Major Outages: Real-World Postmortems and Metrics to Track

UUnknown

2026-02-14

10 min read

A practical guide for SREs: observability metrics and a postmortem template to diagnose and prevent certificate failures during major outages.

When certs fail during major outages: what you must measure and how to learn fast

Nothing breaks trust faster than a certificate error in the middle of a large outage. As SREs and platform engineers we plan for capacity, we automate renewals, and we instrument services — yet certificate failures still cause painful user-facing TLS errors, extended recovery windows, and messy postmortems. This article gives a compact, battle-tested set of observability metrics and a focused postmortem template you can copy into your incident process to turn certificate failures into repeatable learning.

In brief — the most important actions first

Track both issuance and runtime signals. Certificates can look fine in inventory but fail at handshake-time (OCSP/chain/handshake errors).
Automate detection and alerts with clear thresholds. Don’t wait for user reports — alert on TLS handshake failures and ACME challenge errors.
Use a tailored postmortem template for cert incidents. Certificate failures need different artifacts (CT/OCSP stubs, ACME logs, DNS change windows).

Context: why certificate failures amplify major outages (2026 lens)

Outages that began as network or edge failures often cascade into certificate problems. Examples in early 2026 showed large spikes in outage reports where multiple providers (X, Cloudflare, AWS) experienced correlated symptoms: routing abnormalities, DNS anomalies, and transient control-plane errors that impacted automated certificate issuance, OCSP stapling, or edge key access. When the control plane that handles ACME or managed-cert operations is degraded, renewals and staple refreshes can fail at scale — creating a second, highly-visible outage for end users.

Two trends in 2025–2026 make observability essential:

Wider adoption of short-lived certificates (automated 7–30 day lifetimes) reduces blast radius from an expiry, but increases dependency on continuous issuance telemetry.
More multi-layered TLS stacks (edge CDNs, cloud-managed certs, service mesh mTLS) create multiple failure surfaces: CDN staples, origin certs, and intermediate chain changes can each cause client failures.

What to measure: the essential observability metrics

Group metrics into acquisition (issuance), validation (OCSP/CT/chain), runtime (handshake results), and impact slices (user errors, traffic loss). For each metric I include example names, a short PromQL or pseudo-query, and suggested alert thresholds tuned for production.

1. Issuance & renewal metrics

acme_renewals_total — count of renewal attempts per issuer
```
rate(acme_renewals_total[5m])
```
alert: spike in renewals > 3x baseline for 10m (could indicate automated retries or mass expiry)
acme_renewal_failures_total — ACME errors, broken down by error_type (rate-limit, auth-failure, challenge-failure)
```
sum by (error_type) (rate(acme_renewal_failures_total[5m]))
```
alert: any non-zero challenge-failure or auth-failure sustained for 5m
acme_issuance_latency_seconds — histogram of issue times
```
histogram_quantile(0.95, sum(rate(acme_issuance_latency_seconds_bucket[5m])) by (le))
```
alert: 95th percentile > 30s for ACME issuance during normal ops; >120s during outage is critical

2. Validation & supply-chain metrics

ocsp_stapling_success_rate — percent of TLS handshakes that included a valid stapled OCSP response
```
1 - (sum(ocsp_stapling_failures) / sum(tls_handshakes_total))
```
alert: stapling success < 95% for 5m
ct_submission_errors — failed Certificate Transparency submissions
```
rate(ct_submission_errors[5m])
```
alert: any sustained CT error (>0 for 5m) — causes browsers/clients to reject or delay chain validation
chain_mismatch_count — server presented chain didn't match inventory (intermediate missing or wrong order)
```
sum(chain_mismatch_count)
```
alert: >0 for 1m on prod edge

3. Runtime TLS / handshake metrics

tls_handshake_failures_total — total TLS handshake failures
```
rate(tls_handshake_failures_total[1m])
```
alert: increase >50% from baseline sustained 5m or absolute >1000/min for large fleets
tls_cert_error_count{error="expired|unknown_ca|revoked|name_mismatch"} — broken down by client-visible error
```
sum by (error) (rate(tls_cert_error_count[5m]))
```
alert: any client-visible error > 0.1% of requests for 10m
tls_retry_ratio — fraction of requests that required TCP/TLS retries due to cert failures
```
rate(tls_retries_total[5m]) / rate(requests_total[5m])
```
alert: ratio > 1% for public-facing services

4. DNS and challenge-reliant signals

dns_propagation_ms — measured start-to-success for DNS TXT propagation used by DNS-01 challenges
```
histogram_quantile(0.95, rate(dns_propagation_ms_bucket[5m]))
```
alert: 95p > 30s for internal authoritative controls, >180s for external providers
http_challenge_404_rate — ACME HTTP-01 challenge 404 hits
```
rate(http_challenge_404_total[5m])
```
alert: any non-zero during issuance window

5. Impact & client-side telemetry

client_tls_error_rate — browser-level TLS error reports (RUM) or mobile SDK reports
```
rate(client_tls_error_events_total[5m])
```
alert: >0.5% of user sessions
traffic_drop_percent — change in traffic vs baseline
```
1 - (sum(rate(requests_total[5m])) / baseline_requests_per_5m)
```
alert: any drop >10% concurrent with cert error spike

How to build dashboards and alerts (practical recipes)

Visualize the above metrics in grouped panels: issuance health, runtime TLS health, OCSP/CT status, DNS/challenge health, and client impact. Example panels:

Issuance Success Rate — line chart of acme_renewal_failures_total by error_type
Handshake Failures Heatmap — TLS errors by region and edge POP
OCSP Stapling Success — ratio over time with annotated CT-submission events
Client Errors — RUM/SDK reports showing affected browsers/OS versions

Sample Alertmanager rule (pseudo-YAML):

  - alert: CertificateHandshakeErrorSpike
    expr: increase(tls_cert_error_count[10m]) > 100
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "TLS certificate errors spiked for {{ $labels.service }}"
      runbook: "https://runbooks.example.com/tls-handshake-failure"

Troubleshooting quick wins (commands and checks)

When a cert-related incident fires, collect these artifacts immediately — they speed RCA and cross-team coordination.

Fetch the presented chain from multiple edge POPs:

openssl s_client -connect example.com:443 -servername example.com -showcerts

Check OCSP stapling and stapled response validity:

openssl s_client -connect example.com:443 -status -servername example.com

Query CT logs (CertStream/Google CT API) for recent submissions and errors.
```
curl https://certstream.calidog.io/ | jq .
```
Review ACME logs at the issuer and client-side: filter for challenge-failure, rate-limit, and auth errors.
```
grep -i "challenge" /var/log/acme-client.log | tail -n 200
```

Validate DNS for DNS-01 challenges and authoritative responses across resolvers:

dig +trace TXT _acme-challenge.example.com
dig @8.8.8.8 TXT _acme-challenge.example.com

Postmortem template: certificate failure during a large outage

Paste this into your incident tracker. Keep it concise, evidence-first, and tie every claim to a metric.

Executive summary (1–3 sentences)

Describe what happened, impact, timelines, and user-facing symptoms. Example: "On 2026-01-16 10:28–12:10 UTC we observed a spike in TLS handshake failures for the public API, affecting 18% of incoming requests and causing 45% traffic loss to edge POPs in North America. Root cause: OCSP stapling refreshes failed due to control-plane DNS anomalies during a concurrent provider outage."

Timeline (UTC) — ordered, with evidence links

10:27 — monitoring alert: tls_cert_error_rate > 0.5% (link to panel)
10:29 — spike in acme_renewal_failures_total (link to ACME logs)
10:32 — first customer report of ERR_CERT_COMMON_NAME_INVALID (link to RUM trace)
10:45 — mitigation: reroute traffic to fallback cert fleet; OCSP stapling re-enabled; traffic partially recovered
11:20 — full recovery as CT submissions cleared

Impact

Systems affected: public API, web login, static CDN assets via edge
Users affected: 18% of requests; ~23% of unique sessions reported TLS error
Business impact: login failures, degraded conversion for 1.5 hours

Root cause and contributing factors

List primary cause and systemic contributors. Include logs and metric snapshots.

Primary: OCSP stapling service failed to refresh due to upstream DNS control-plane instability at the CDN provider, which prevented fetching OCSP responder endpoints for a subset of intermediates.
Contributing:
- Edge didn't have fallback cached stapled responses older than 1 hour.
- ACME client mass retry behavior hit rate limits with Let's Encrypt during issuance attempts for short-lived certs — consider automated backoffs and burst controls (see rate/patching playbooks).
- No automated test exercised the full chain (including OCSP stapling) during controlled failover drills.

Remediation & short-term mitigations

Rollback: Re-enable cached stapled responses and route traffic to cert-fallback pool.
Hotfix: Increase ACME client retry backoff to reduce burst rate-limiting.
Communication: Notify customers via status page and post incident update within 60 minutes.

Long-term actions (owners and deadlines)

Implement multi-issuer strategy and automated cross-checks (Owner: platform infra; Due: 4 weeks)
Add OCSP/CT synthetic monitors across all edge POPs (Owner: observability; Due: 2 weeks)
Runbook: Create a dedicated TLS incident playbook with commands and dashboards (Owner: SRE; Due: 1 week)
Drill: Quarterly simulated edge outage with cert/OCSP failure scenarios (Owner: SRE lead; Due: next quarter)

Validation of fixes

Attach pre/post graphs for the metrics listed earlier. Include synthetic test outputs that confirm OCSP stapling, CT submissions, and ACME issuance success rates meeting SLIs. Consider adding automated summarization to incident notes to speed RCA handoff (use AI summarizers for timeline synthesis where appropriate — see agent summarization workflows).

Appendices

Raw logs: ACME client logs, CDN control plane logs, OCSP responder traces
Configuration diffs: cert config for edge, ACME clients
Relevant tickets and provider status pages

Use the timeline and metrics as your single source of truth — annotate everything with links to panels and raw logs.

Real-world examples and lessons learned (using the Jan 2026 spike)

In mid-January 2026 multiple providers reported correlated outages and a spike of user-facing errors. Teams that did well shared these traits:

They had real-time CT and OCSP monitors that correlated certificate failures with provider control-plane degradations.
They used fallback cert pools (pre-warmed cert bundles) so that the edge could serve a valid chain even when live stapling failed.
They kept ACME issuance backoff policies to prevent stampedes to Let's Encrypt or other public issuers.

Teams that struggled typically lacked instrumentation for runtime TLS and treated certificates only as inventory objects. The result: renewals looked green but handshakes failed.

Advanced strategies and 2026 predictions

Looking at trends through 2026, here are advanced strategies to adopt now:

Multi-issuer redundancy. Maintain an internal CA or backup issuer to failover when public issuers are rate-limited or unreachable (see migration playbooks and provider-change guidance).
Edge-level cached staples and graceful degradation. Configure CDNs to serve cached OCSP responses and to accept stale staples for bounded windows during control-plane issues; ensure your edge plays well with edge failover strategies.
Telemetry standardization for TLS. Adopt a common schema for tls_* metrics, OCSP/CT events, and ACME logs so SIEMs and ML detectors can surface anomalies faster.
Synthetic whole-chain testing. In 2026 many observability vendors now include CT/OCSP monitors; build daily synthetic tests that validate the full handshake from multiple client types and geographies (and capture artifacts as part of your evidence capture workflow).

Checklist for immediate readiness (copy into your incident runbook)

Do you have tls_cert_error_rate, ocsp_stapling_success_rate, and acme_renewal_failures in your dashboards? If not, add them now.
Do runbooks include OpenSSL commands and where to find ACME logs on disks and in cloud logs?
Does your incident template include CT/OCSP artifacts and an owner for provider communication?
Do you have a pre-warmed fallback cert pool? If not, schedule to create one.

Actionable takeaways

Instrument issuance and runtime separately. Both matter — one shows “inventory” health, the other shows user impact.
Alert on client-visible TLS errors first. They are the most direct indicator of customer experience degradation.
Include CT and OCSP checks in postmortems. These are often the missing forensic artifacts.
Practice failovers and drills. Run synthetic tests that simulate provider outages and verify your fallback cert strategy (and consider automated summarization integration to speed RCA handoffs).

Closing: build observability into your certificate lifecycle

Certificate failures during major outages are not a singular system problem — they are a cross-cutting concern that spans DNS, ACME, edge control planes, and client validation. In 2026 the best teams treat TLS like any critical distributed system: instrument it aggressively, automate renewals conservatively (avoid stampedes), and have a playbook that includes CT/OCSP evidence and issuer fallback plans.

Start now: add the listed metrics to your monitoring, copy the postmortem template into your incident process, and run a certificate-failure drill before the next provider spike.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.