Postmortem: How a Karpenter 1.0 Node Provisioning Bug Caused a 1-Hour Outage

1 / 2

Postmortem: How a Karpenter 1.0 Node Provisioning Bug Caused a 1-Hour Outage

DEV Community·ANKUSH CHOUDHARY JOHAL·about 1 month ago

#pGxAUZA2

#tip #postmortem #karpenter #node #provisioning #topology

Reading 0:00

15s threshold

On October 17, 2024, at 09:42 UTC, a single unhandled edge case in Karpenter 1.0’s node provisioning logic took down 83% of our production Kubernetes workloads for 61 minutes, costing an estimated $142k in lost revenue, 12 SLA penalty fees from enterprise customers, and 3 escalated support tickets from Fortune 500 clients. The root cause wasn’t a missing IAM permission, a misconfigured AWS VPC, or a cloud provider outage—it was a 12-line regression in Karpenter’s core scheduling package that slipped past 142 unit tests, 18 integration suite runs, and our 2-week canary rollout to 5% of production nodes. This postmortem details exactly what went wrong, how we fixed it, and the benchmarks and code you need to avoid the same mistake.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

Postmortem: How a Karpenter 1.0 Node Provisioning Bug Caused a 1-Hour Outage