I’ve been experimenting with an inference stack for a side project and wanted to share something that surprised me. The problem: cold starts were killing my UX. Users hitting a chat endpoint would occasionally wait 3-5 seconds because the model was serving from a cold container. Here’s what I did: Session-aware routing Instead of round-robin to any available node, I pinned sessions to warm instances for a sliding TTL window. If a user returns within 60 seconds, they hit the same GPU node. lightweight pre-fetch I added a health-check route that primes the KV cache by sending a dummy token before the actual request. This keeps the model hot without wasting real compute. Model choice mattered more than I expected I tested several providers and models. The biggest latency wins came from the model architecture itself. For my workload (multi-turn reasoning with long context), DeepSeek-V4-Pro cut decoding time noticeably compared to what I was using before.…