Menu

Post image 1
Post image 2
1 / 2
0

Postmortem: A Kubernetes 1.31 Node Pool Outage Took Down Our App for 4 Hours – Root Cause: Misconfigured Spot Instances

DEV Community·ANKUSH CHOUDHARY JOHAL·about 1 month ago
#b9iBgoWZ
Reading 0:00
15s threshold

On October 17, 2024, our production Kubernetes 1.31 cluster suffered a 4-hour 12-minute total outage affecting 142,000 daily active users, 38 enterprise customers, and $240k in hourly transaction volume—all triggered by a single misconfigured spot instance node pool provisioner. 🔴 Live Ecosystem Stats ⭐ kubernetes/kubernetes — 122,018 stars, 42,991 forks Data pulled live from GitHub and npm. 📡 Hacker News Top Stories Right Now Why does it take so long to release black fan versions? (88 points) Job Postings for Software Engineers Are Rapidly Rising (146 points) Ti-84 Evo (418 points) Spirit Airlines Is Winding Down All Operations (17 points) Ask.com has closed (202 points) Key Insights Kubernetes 1.31's new NodePool API GA (released August 2024) introduces stricter spot instance eviction handling that exposed our legacy provisioner misconfiguration Using Terraform 1.9.5 with the AWS EKS module v20.11.0, we incorrectly set spot_price = "0.10" instead of using the on-demand baseline for our m7g.large…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More