A team I worked with once migrated an order-placement path from gRPC to NATS because "it's decoupled and faster." The old flow was simple: the web service called PlaceOrder via gRPC, got back an order ID, rendered success to the user. The new flow: web service publishes order.place to NATS, an order-service consumes it and processes asynchronously. Within three weeks they had three kinds of incidents on rotation: Duplicate orders — retry on the publisher side meant the same order was placed twice when the first publish actually succeeded but the ack was slow. Lost orders — consumer crashed mid-process; no ack meant NATS redelivered, but the consumer had already partially committed state, so redelivery was rejected by a dedup check. The order just... disappeared from the user's perspective. Dark-failure support tickets — users reported "I clicked buy and nothing happened." From the publisher side, everything looked fine.…