TL;DR: We migrated 10+ microservices from direct HTTP calls to Kafka event-driven communication. Reliability improved massively but the migration was harder than expected. Here are the real lessons including the mistakes. Our system started as a monolith. Then we split it into microservices. The services talked to each other using direct HTTP calls. Service A would POST to Service B which would POST to Service C. It worked fine when we had 3 services. Then we had 10. The Day Everything Cascaded One Tuesday morning our notification service crashed because of a memory leak. No big deal right? Restart it and move on. But the order service was calling the notification service directly during checkout. When notification service was down the order endpoint started timing out. Users could not place orders. The billing service was also calling notification service to confirm payment receipts. Billing started failing too.…