A real-world incident narrative + definitive best practices for CoreDNS at scale Prologue: The Calm Before the Storm The cluster was healthy. 312 pods spread across 24 nodes. CoreDNS two replicas, default settings, humming along since the cluster was provisioned eighteen months ago. Nobody had touched it. Nobody needed to touch it. Until the Wednesday nobody expected. Chapter 1 The Incident: "Why Is Payment Timing Out?" It started with a Slack ping at 11:42 PM. @oncall-alert [CRITICAL] Payment service unreachable circuit breaker open on checkout-gateway I SSH'd into the jump box. First instinct: kubectl get pods . $ kubectl get pods -n production | grep payment payment-svc-8d4f6b7c-x2k9m 1/1 Running 0 45d order-processor-6c8d9f4-x7q2w 1/1 Running 0 45d Enter fullscreen mode Exit fullscreen mode All pods running. All healthy.…