The uncomfortable failure is not a carrier outage. That one is loud. The polling loop logs it, retries it, and moves on. The failure I had to design around is quieter: the gateway receives a real carrier update, writes it to the database, then Redis is down at the exact moment the WebSocket fanout should happen. The client misses the live push. The shipment state is still correct. The event timeline is still correct. But the thing the user was watching in real time never moves. That sounds like a broken real-time system until you decide which part is allowed to be temporary. I made the database the truth and the WebSocket stream the delivery layer. That one decision shaped the rest of the gateway. The real boundary DHL, DPD, and GLS do not send one clean stream of facts. The adapters have to handle different request shapes, different status codes, and different timestamp rules. DHL uses DHL-API-Key and returns shipment events under shipments[0].events .…