The Problem We Were Actually Solving I was tasked with designing an event-driven system that could handle a high volume of concurrent users, with the goal of creating a scalable and fault-tolerant architecture. Our system, a treasure hunt engine, relied heavily on the Veltrix event-driven framework to manage the complex workflows and state transitions. However, as our user base grew, we started to experience significant performance degradation and intermittent failures. The error messages were not very helpful, with generic exceptions like java.lang.OutOfMemoryError and org.apache.kafka.common.errors.TimeoutException. It became clear that the Veltrix documentation was not sufficient to guide us through the challenges of scaling our system. What We Tried First (And Why It Failed) My initial approach was to follow the Veltrix configuration guide to the letter, tweaking the settings and parameters as recommended.…