TL;DR — GPU isn't always the right call for inference. At Leaseweb, we benchmarked a dual-socket EPYC 9334 on 7B–20B LLMs and three TTS models. Here's what the numbers actually look like — and when CPU inference makes sense. Why inference is where your budget actually disappears Training is a one-time cost. Inference is not. Once a model is in production, it runs continuously — and cost per query scales directly with traffic. For many teams, inference spend overtakes training spend within months of launch. The hardware decision for inference is also different from training. Training wants large GPU clusters with high-bandwidth interconnects. Inference wants low latency, high throughput per dollar, and enough memory bandwidth to serve quantised weights efficiently. Those requirements don't always point to a GPU. The two metrics that actually matter for LLM inference When a prompt hits an LLM, two stages happen: Prefill — the model converts input tokens, runs them through its layers, and builds a KV cache.…