When Retries Turn Hostile — How Control Logic Kills Production Systems

1 / 2

When Retries Turn Hostile — How Control Logic Kills Production Systems

DEV Community·Ken Imoto·about 1 month ago

#2fdLX101

#three #sre #devops #reliability #retry #retries

Reading 0:00

15s threshold

"Your retries are killing us." A service team received this message from a downstream dependency during an outage. The upstream API was timing out, so naturally, the client retried. 3 times, 5 times, 10 times. The client thought it was doing the right thing. From the dependency's perspective, they were at half capacity due to the outage — and receiving several times the normal traffic. Retries were making the outage worse and preventing recovery. This isn't a fable. In August 2012, Knight Capital's trading system activated legacy code (Power Peg) during a deployment, generating millions of orders over 45 minutes. Orders were never marked as "complete," so the system kept regenerating them. The feedback loop never closed. The structural result: an infinite re-execution loop with the same dynamics as a retry storm. $440 million lost, company effectively bankrupt. Retries exist to survive failures. But when designed carelessly, retries become the failure.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

When Retries Turn Hostile — How Control Logic Kills Production Systems