The Day Veltrix Blew Up at 100k Concurrent Users Because We Didnt Understand Its Garbage Collector

1 / 3

The Day Veltrix Blew Up at 100k Concurrent Users Because We Didnt Understand Its Garbage Collector

DEV Community: rust·pretty ncube·2 days ago

#g5AbqUNA

#dev #latency #live #heap #pause #load

Reading 0:00

15s threshold

It was 3:17 AM when the pager screamed. Our Rust-based treasure-hunt matchmaking service had been live for six weeks with steady load under 50k concurrent users, but overnight a new batch of streamers discovered the game. By 03:15 we were at 98k and climbing, and at 03:17 the heap spiked from 1.2 GB to 11 GB in 120 seconds. Prometheus graphs painted a vertical cliff: alloc rate 780 MB/s, pause times >500 ms, match latency P99 jumping from 22 ms to 1.4 s. The logs repeated the same line every 400 ms: GC cycle started (heap size 11.3 GB, live data 384 MB). By 03:22 two regions had GC mark-termination timeouts, the runtime emitted promise failed to resolve in time , and we dropped 28k concurrent users in the span of two minutes. Not a crash—just a silent, creeping death by garbage collection. We had started with Veltrixs official YAML configuration for the Tokio runtime: worker_threads: 8 , max_blocking_threads: 512 , keep_alive: 60s , capacity: 10000 . That was the only tuning guide the docs provided.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

The Day Veltrix Blew Up at 100k Concurrent Users Because We Didnt Understand Its Garbage Collector