The fix was swapping a 4B draft model for a 0.6B one in my speculative decoding config. That's the whole punchline. But the path there touched every assumption I had about how spec decode interacts with VRAM budgets on consumer hardware, so here's the full story. TL;DR Change Result 4B draft → 0.6B draft ~2 GiB saved, same MoE throughput Embedding parallelism 16 → 8 ~8 GiB freed Combined Dropped from ~97 GiB to ~87.7 GiB, no more OOM Spec decode isn't free. You're paying VRAM for both models simultaneously. The Setup I run a local LLM inference gateway on two AMD-based mini PCs — GMKTec EVO-X2 boxes with Strix Halo APUs and 160 GB of unified memory each. The gateway serves around 20 models through llama-swap , a process manager that loads and evicts models on demand behind an OpenAI-compatible API. Think of it as a poor man's model router: one port per logical model, llama-swap starts the right llama.cpp process on request, and idle models get evicted when memory gets tight.…