The Problem We Were Actually Solving On the surface, it seemed like we were just building a complex event-driven system to handle treasure hunt requests. However, we were actually solving a much deeper problem - creating a highly scalable and responsive matchmaking engine that could handle thousands of users simultaneously. We wanted to create an experience where users could seamlessly interact with the treasure hunt system, without noticing any delays or errors. What We Tried First (And Why It Failed) When we first started building the treasure hunt engine, we decided to go with a classic pub/sub architecture, leveraging Apache Kafka as our event bus. We set up a series of ZooKeeper instances to manage our Kafka clusters, and our application code would simply publish events to topics and subscribe to those events to process them. Sounds simple enough, right? But what we failed to consider was the exponential scaling costs of managing a large number of topics and ZooKeeper instances.…