monitoringctops

Monitoring Certificate Health at Scale: Alerts, Dashboards and CT-Based Detection

UUnknown

2026-02-19

11 min read

A 2026 playbook for SREs: monitor cert expiry, OCSP stapling, CT entries and issuance anomalies with dashboards, alerts, and runbooks.

Hook: When social platforms and major CDNs spike outages or get targeted by mass attacks, one common failure mode for dependent services is certificate—or certificate‑automation—breakage. If an expired cert, missing OCSP staple, or a rogue issuance suddenly knocks your APIs or web apps offline, you need monitoring and runbooks that detect and resolve the failure before customers notice.

This playbook—informed by the outage and attack waves of late 2025 and early 2026—shows how SRE teams can build practical dashboards, alerts, and automated remediation for certificate expiry, OCSP stapling, Certificate Transparency (CT) anomalies, and issuance irregularities. It’s written for developers and ops teams running at scale using Let's Encrypt and ACME-based automation, CDNs, and multi‑cloud load balancers.

Why this matters in 2026

Short‑lived certificates (e.g., Let's Encrypt's ~90‑day model) make automation essential, but also raise the blast radius when automation misconfigures.
CT log visibility has grown—attackers and defenders both use CT feeds to surface new certificates quickly. Detecting unexpected issuance is now a primary early-warning signal for domain abuse.
OCSP stapling and stapled responses matter for availability and compliance; missing staples are a frequent cause of TLS errors during partial outages of OCSP responders or caching layers.
Late‑2025 incidents showed mass outages can cascade from PKI or CDN failures; SRE teams need actionable dashboards and runbooks to avoid “alert storms” and to restore service quickly.

High-level monitoring strategy

Use a defense-in-depth approach with three layers:

Passive telemetry — collect what’s already in your stack (load balancer TLS metrics, CDN health, webserver certs).
Active checks — synthetic probes for expiry, OCSP stapling, handshake metrics, and HTTPs connectivity on all frontends and API endpoints.
Outside-in signals — CT feeds and public scan data to detect unanticipated certificates and large-scale CA activity affecting your namespace.

What to monitor (and why)

Expiry — primary availability risk; alert early (14/7/2 days) and track auto‑renewal success metrics.
OCSP stapling — missing or expired staples cause client validation failures and latency spikes when OCSP responders are slow or unavailable.
SCT / CT presence — new certs should appear in CT logs. Unexpected CT entries for your domains can indicate fraudulent issuance.
Issuance anomalies — spikes in issuance rate, certificates issued to unusual SANs, or certs signed by unexpected CAs.
Handshake errors and TLS versions/ciphers — regressions after automation changes or CA pushes can cause client compatibility issues.
ACME automation health — failure rates on renewal jobs, DNS‑01 challenges failing, rate limits reached.

Building the monitoring stack: tools and components

Recommended components for a scalable solution:

Prometheus + Alertmanager — for metrics, alerting and rate‑limiting alerts.
Grafana — dashboards and incident dashboards (playbook links embedded).
Blackbox exporter / cert_exporter — active cert checks and expiry metrics.
OCSP checkers — simple scripts or exporters to validate stapling and OCSP response codes.
CT feeds — CertStream (websocket), Google CT API, crt.sh queries for historical lookup.
Log aggregator — Splunk/Elastic/Vector to centralize TLS handshake failures, ACME logs and LB errors.
Incident automation — PagerDuty, Opsgenie, Slack + playbook orchestration (Runbooks as Code).

Sample metrics and Prometheus export patterns

Key metrics you should be exporting:

tls_cert_not_after_seconds{job,instance,domain} — unix timestamp of certificate expiry.
tls_ocsp_status{job,instance,domain} — 0/1 for stapled valid/invalid.
acme_renewal_last_success_timestamp{job,domain} — when renewal last succeeded.
cert_issuance_events_total{domain,issuer} — count of new certificates observed (ingested from CT).

PromQL examples

Expiry alerts: fire early and escalate.

sum by (domain) ( (tls_cert_not_after_seconds - time()) < 1209600 )

OCSP stapling alert: missing or invalid staple in the last check.

tls_ocsp_status == 0

Issuance surge (detect abnormal new certs week over week):

increase(cert_issuance_events_total[1h]) > 10 and increase(cert_issuance_events_total[7d]) < 5

CT‑based detection: early warning for domain abuse

CT feeds are now a standard early‑warning channel. In 2026 the CT ecosystem processes billions of certificates—subscribe to a real‑time feed and build lightweight detectors that look for:

Any new issuance containing your exact domain or high-risk wildcard patterns (e.g., *.yourdomain.example)
Certificates issued by unexpected CAs or to unknown SAN combinations
Large volumes of certificates for related domains (suggesting automated abuse)

Example: Python CertStream detector

This example listens to CertStream and posts to Alertmanager or Slack when a suspicious cert is observed. (Trimmed for clarity.)

import certstream, requests

WATCH_DOMAINS = {"example.com", "api.example.com"}
ALERTHOOK = "https://alertmanager.example.internal/api/v1/alerts"

def cb(message, context):
    if message['message_type'] != 'certificate_update':
        return
    cert = message['data']['leaf_cert']
    san = cert.get('all_domains', [])
    if WATCH_DOMAINS & set(san):
        payload = [{
          'labels': {'alertname': 'CTCertificateObserved','domain': ','.join(WATCH_DOMAINS & set(san))},
          'annotations': {'summary': 'New certificate seen in CT for watched domain', 'details': str(cert)}
        }]
        requests.post(ALERTHOOK, json=payload)

certstream.listen_for_events(cb, url='wss://certstream.calidog.io/')

Notes:

Run this in a horizontally scalable service so spikes in CT volume don't drop events.
Enrich alerts with DNS owner emails, ACME actor, and recent issuance rate to reduce false positives.

OCSP stapling checks and remediation

OCSP stapling problems can silently degrade users' ability to validate certificates—especially mobile clients that enforce stapling more strictly. Monitor both staple presence and the OCSP response validity period.

Active OCSP check (bash)

# query a server's stapled OCSP response
  openssl s_client -connect api.example.com:443 -status </dev/null 2>&1 | awk '/OCSP Response/{print; exit}'

Automate this into a script that parses the OCSP response for "good", expiry of the OCSP response and alerts when the response will expire in less than X hours.

Remediation runbook for missing OCSP staple

Confirm server presents the correct certificate chain (openssl s_client -showcerts).
Check webserver/config: enable or reenable stapling (Nginx: ssl_stapling on; Apache: SSLStaplingResponderTimeout and enable mod_ssl stapling directives).
If behind a load balancer or CDN, verify that load balancer is requesting and caching OCSP responses correctly. Some managed CDNs have separate OCSP settings.
Restart TLS worker processes gracefully and monitor metrics (staple reappears within minutes).
If the CA OCSP responder is down or slow, temporarily disable OCSP strict client checks in a phased manner only when allowed by policy, and follow up with re‑enable once stable.

Issuance anomaly playbook

When CT or internal telemetry shows unexpected issuance:

Correlate the certificate by SAN and issuer with internal ACME logs, DNS provider logs, and IAM actions during the same time window.
If cert is unauthorized, revoke it immediately (CA portal or ACME revoke), and file a report with the CA/Browser forum or CA's security contact.
Search for related DNS API keys or ACME account keys exposure — rotate keys and secrets that could have been used to issue certificates.
Notify legal/security and apply takedown to TLS endpoints (e.g., firewall/ACL rules to block malicious hosts presenting the cert).
Apply CT monitor filters and automated blocks: temporarily reject TLS sessions from IPs serving suspicious certs until validated.

ACME automation health checks and runbooks

Automation is both the solution and a risk. Add these standard checks to your monitoring and runbooks:

Renewal job success rate (alert if >5% failures in a day).
DNS challenge failures by provider (rate limit and auth error classification).
ACME rate limit warnings (use ACME challenge and account quotas to preemptively throttle renewals).

Quick remediation steps when renewals fail

Check ACME logs for error codes (e.g., "challenge not found", "rateLimited").
If dns‑01 failing: validate API keys, check DNS propagation and TTLs, and run manual TXT creation to reproduce.
If http‑01 failing: verify target path is reachable, check reverse proxies and caching rules, and ensure ACME challenge responses aren’t being blocked by WAF rules.
Failover to backup cert: use a pre‑staged cert on load balancer/CDN if automated renewal cannot be fixed within the SLA window.
Perform a controlled renewal with a different ACME client or alternate ACME endpoint (staging) to test fix before production.

Dashboards: what a single Pane of Glass should show

Build an incident dashboard that self-serves SREs and on‑call:

Top‑level health: percent of frontends with valid certs (green/yellow/red)
Expiry timeline: upcoming expirations (14/7/2 days) by service and owner
OCSP status: count of failed staples and their endpoints
CT alerts: recent suspicious CT hits with links to certs and runbook steps
ACME automation: last successful run, failure rate, and error categories
Incident links: one-click runbook, PagerDuty play, and Slack channel join

Dashboard best practices

Embed runbooks and remediation commands per panel so on‑call can act without hunting for documents.
Use templated variables for services and domains to keep dashboards compact and searchable.
Include historical baselines (last 30/90 days) to detect issuance surges vs normal activity.

Alerting policy: avoid alert fatigue but act fast

Create multi‑tier alerts:

Informational — expiry >14 days, low‑severity ACME warnings (channel: email)
Action — expiry <7 days, OCSP missing, renewal job failures (channel: Slack + runbook link)
Urgent — expiry <48 hours with failed renewal, CT rogue issuance & not revoked (channel: PagerDuty)

Automation and remediation examples

Automate safe remediation steps; prefer idempotent actions and require manual approval for destructive remediation.

Auto‑renew helper (bash + ACME client)

#!/bin/bash
  # quick helper to re-run ACME renew and reload LB
  certbot renew --deploy-hook "/usr/local/bin/reload-lb.sh" --quiet || (
    echo "Renewal failed, creating PagerDuty incident"
    curl -XPOST https://pd.example/api/incidents -d '{"summary":"cert renewal failed: $HOSTNAME"}'
  )

Prefer dedicated orchestration tools (e.g., Ansible, Terraform) for larger fleets; record actions in incident timeline automatically.

Testing & chaos: prove your monitoring

Regularly run controlled experiments:

Rotate a test cert to an expired state in staging to validate expiry alerts and runbook timing.
Simulate OCSP responder degradation and verify alerts and failover behavior.
Inject a synthetic CT entry for a test domain and ensure the CT detector fires and follows the runbook.
Runbook rehearsals quarterly—require an on‑call to perform the documented remediation.

Troubleshooting matrix: quick reference

Use this condensed table as a playbook lookup during incidents (reformat into your tool):

Symptom: 503 errors after CDN change — Check: certificate chain at CDN, OCSP staple, CDN config; Action: roll back CDN TLS setting
Symptom: Renewal failures — Check: ACME logs, DNS provider API errors, rate limits; Action: rotate API key, trigger manual renewal
Symptom: Unexpected CT cert — Check: issuance metadata, internal ACME activity, revoke if malicious; Action: revoke + rotate keys

Operational recommendations & 2026 trends to watch

Shift left on CT monitoring: integrate CT alerts into CI for new service registrations so teams are notified immediately about any unauthorized certificates issued for a new domain.
Adopt runbooks as code (stored with service repos) so runbooks version with the service and can be executed by automation during incidents.
Watch for privacy‑preserving certificate issuance options (late‑2025 pilots) that may reduce CT visibility; keep a supply of internal telemetry if CT coverage reduces.
Use short lifecycle certs but harden your automation: rotation frequency is a feature, not a burden—instrument it aggressively.

Case study (composite): Detecting mass issuance during a platform outage

In December 2025, several large platforms experienced partial outages and a surge of automated account‑compromise activity. An SRE team using the playbook above detected an issuance spike for API subdomains via CertStream: 120 new certs in one hour. Their automation:

Auto‑escalated a PagerDuty incident (CT anomaly + issuance rate > threshold).
Correlated the certificates to an expired API gateway token in IAM logs.
Revoked suspicious certs, rotated the exposed token, and used a staged DNS roll to quarantine affected endpoints.
Post‑incident, they added a pre‑commit hook to create an audit entry whenever a new ACME account is registered for the org.

Actionable takeaways

Implement three layers of monitoring: passive, active, and outside‑in CT feeds.
Alert with graduated severity: informational (14d), action (7d) and urgent (48h + failed renewal).
Automate safe remediation paths but require manual approval for high‑risk actions like mass revocation.
Embed runbooks in dashboards and practice them regularly; chaos test for certificate failures.
Incorporate CT monitoring for early detection of domain abuse—subscribe to real‑time feeds and integrate with your alerting pipeline.

Monitoring certificate health isn’t just about catching expired certs — it’s about detecting automation failure, PKI abuse, and external service outages early enough to stop customer impact.

Final checklist before you go live

Are expiry alerts firing at 14/7/2 days with owner info?
Do you have OCSP staple validity metrics and an automated remediation path?
Is CertStream (or equivalent) feeding CT events into Alertmanager and your SIEM?
Are ACME automation failures visible and classified by error type?
Do runbooks live with services, are they executable, and have they been rehearsed in the last 90 days?

Call to action

If you’re responsible for large fleets of certificates, start by adding CT listening and an expiry panel to your incident dashboard this week. Instrument the three metric types described here and run one renewal failure drill in staging. If you want a starter kit—Prometheus rules, Grafana dashboard JSON, and a CertStream detector template—grab our open‑source repo on GitHub and adapt the playbook to your stack.

Get the starter kit: clone, deploy the certstream detector, wire Alertmanager, and run a staged renewal failure. If you need help architecting this for your environment, our team at letsencrypt.xyz offers consultation and runbook templates tailored to multi‑cloud and CDN setups.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.