I had idempotency, a message queue, and retries. I thought I was finally building something production-ready. Then I started thinking about the "in-between" failures. What if the worker starts a charge, the provider succeeds, but the worker crashes before it can update the database? The payment stays stuck in processing forever. The user was charged, but my system thinks it's still "happening." This is an orphaned payment, and it’s a nightmare to debug because, from the outside, it looks like nothing went wrong. The Processing Lease I solved this by treating the processing status as a lease. When a worker starts, it stamps the payment with a processing_started_at timestamp. If a payment has been in processing for more than 2 minutes, I assume the worker died. I wrote a "sweeper" that finds these stale leases and resets them to failed so they can be picked up by the retry logic.…