This is a dummy post. Replace with a real write-up — the structure below is the template every SEV-1 (deep dive) entry follows.
Impact
None — and that was the whole point. This is a postmortem for an incident that didn't happen: rotating the certificate authority underneath a production service mesh carrying banking traffic, with zero dropped connections.
Background
Our Istio mesh anchors workload identity to AWS Private CA. Certificates are short-lived and rotate automatically — but the CA itself has a lifetime too, and "the CA expires" is not an alert you want to meet unprepared.
What we did
- Staged a new intermediate alongside the old one, so istiod could issue from either.
- Distributed the combined trust bundle first. Every workload must trust both roots before any workload presents a cert from the new one. This ordering is the entire game.
- Flipped issuance to the new intermediate and watched cert age drain down as workloads renewed naturally.
- Removed the old root from the bundle only after the last old-issued cert expired — verified from SPIFFE identities in access logs, not from hope.
# The dashboard that mattered: how many live certs still chain to the old root
istioctl proxy-config secret $POD -o json \
| jq -r '.dynamicActiveSecrets[].secret.tlsCertificate.certificateChain.inlineBytes' \
| base64 -d | openssl x509 -noout -issuer
What we learned
- Trust distribution and issuance are two separate rollouts. Collapse them into one and you get a thundering herd of TLS handshake failures.
- Envoy hot-swaps certs cleanly, but long-lived gRPC streams pin their handshake-time identity. Drain them deliberately or they'll surprise you.
- Rehearse in staging with production-shaped traffic. Our first staging run found a gateway that had a root pinned in a ConfigMap nobody remembered.
Action items
- Automate bundle-propagation verification as a pre-flight check
- Add cert-chain-age panel to the mesh Grafana board
- Runbook the gRPC drain procedure