I Fixed My LLM OOM Crashes by Shrinking the Draft Model (Speculative Decoding on Real Hardware)

1 / 4

I Fixed My LLM OOM Crashes by Shrinking the Draft Model (Speculative Decoding on Real Hardware)

DEV Community·Nic Lydon·about 1 month ago

#6myabvl2

#ai #llm #machinelearning #draft #model #embedding

Reading 0:00

15s threshold

The fix was swapping a 4B draft model for a 0.6B one in my speculative decoding config. That's the whole punchline. But the path there touched every assumption I had about how spec decode interacts with VRAM budgets on consumer hardware, so here's the full story. TL;DR Change Result 4B draft → 0.6B draft ~2 GiB saved, same MoE throughput Embedding parallelism 16 → 8 ~8 GiB freed Combined Dropped from ~97 GiB to ~87.7 GiB, no more OOM Spec decode isn't free. You're paying VRAM for both models simultaneously. The Setup I run a local LLM inference gateway on two AMD-based mini PCs — GMKTec EVO-X2 boxes with Strix Halo APUs and 160 GB of unified memory each. The gateway serves around 20 models through llama-swap , a process manager that loads and evicts models on demand behind an OpenAI-compatible API. Think of it as a poor man's model router: one port per logical model, llama-swap starts the right llama.cpp process on request, and idle models get evicted when memory gets tight.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

I Fixed My LLM OOM Crashes by Shrinking the Draft Model (Speculative Decoding on Real Hardware)