When CoreDNS Falls Silent : A Kubernetes DNS Disaster Story & The Playbook That Saved Us

1 / 2

When CoreDNS Falls Silent : A Kubernetes DNS Disaster Story & The Playbook That Saved Us

DEV Community·Akshat Sinha·20 days ago

#POkfAtTp

#chapter #kubernetes #devops #coredns #fullscreen #cluster

Reading 0:00

15s threshold

A real-world incident narrative + definitive best practices for CoreDNS at scale Prologue: The Calm Before the Storm The cluster was healthy. 312 pods spread across 24 nodes. CoreDNS two replicas, default settings, humming along since the cluster was provisioned eighteen months ago. Nobody had touched it. Nobody needed to touch it. Until the Wednesday nobody expected. Chapter 1 The Incident: "Why Is Payment Timing Out?" It started with a Slack ping at 11:42 PM. @oncall-alert [CRITICAL] Payment service unreachable circuit breaker open on checkout-gateway I SSH'd into the jump box. First instinct: kubectl get pods . $ kubectl get pods -n production | grep payment payment-svc-8d4f6b7c-x2k9m 1/1 Running 0 45d order-processor-6c8d9f4-x7q2w 1/1 Running 0 45d Enter fullscreen mode Exit fullscreen mode All pods running. All healthy.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

When CoreDNS Falls Silent : A Kubernetes DNS Disaster Story & The Playbook That Saved Us