Menu

The Spot Instance That Killed Our Payments Service (And Why It Took Us 47 Minutes to Find It)
πŸ“°
0

The Spot Instance That Killed Our Payments Service (And Why It Took Us 47 Minutes to Find It)

DEV CommunityΒ·PeterΒ·about 1 month ago
#mwfI6GRA
#kubernetes#devops#sre#postmortem#node#fullscreen
Reading 0:00
15s threshold

It started at 1:49 AM. PagerDuty fired β€” payments-service entering CrashLoopBackOff, 3 replicas simultaneously. On-call engineer paged. I joined the incident bridge 4 minutes later. By 2:36 AM, we had the fix deployed. 47 minutes of debugging for a 2-line YAML change. This is the postmortem. Not of the incident itself β€” those exist internally β€” but of the investigation . Every wrong turn, every wasted minute, and the exact signals that eventually cracked it. ## The First 10 Minutes: The Obvious Wrong Answer When pods crash simultaneously right after a deployment, the deployment is guilty until proven innocent. That's the right instinct most of the time. So the first 10 minutes were spent here: kubectl rollout history deployment/payments-service -n production kubectl describe deployment/payments-service -n production Enter fullscreen mode Exit fullscreen mode The last deployment had gone out at 7:52 PM β€” over 6 hours earlier. The pods had been healthy for 6 hours since that deploy.…

Continue reading β€” create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More