The Spot Instance That Killed Our Payments Service (And Why It Took Us 47 Minutes to Find It)

📰

The Spot Instance That Killed Our Payments Service (And Why It Took Us 47 Minutes to Find It)

DEV Community·Peter·about 1 month ago

#kubernetes #devops #sre #postmortem #node #fullscreen

Reading 0:00

15s threshold

It started at 1:49 AM. PagerDuty fired — payments-service entering CrashLoopBackOff, 3 replicas simultaneously. On-call engineer paged. I joined the incident bridge 4 minutes later. By 2:36 AM, we had the fix deployed. 47 minutes of debugging for a 2-line YAML change. This is the postmortem. Not of the incident itself — those exist internally — but of the investigation . Every wrong turn, every wasted minute, and the exact signals that eventually cracked it. ## The First 10 Minutes: The Obvious Wrong Answer When pods crash simultaneously right after a deployment, the deployment is guilty until proven innocent. That's the right instinct most of the time. So the first 10 minutes were spent here: kubectl rollout history deployment/payments-service -n production kubectl describe deployment/payments-service -n production Enter fullscreen mode Exit fullscreen mode The last deployment had gone out at 7:52 PM — over 6 hours earlier. The pods had been healthy for 6 hours since that deploy.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

The Spot Instance That Killed Our Payments Service (And Why It Took Us 47 Minutes to Find It)