How to Serve Mistral Medium 3.5 128B Without Running Out of GPU Memory

1 / 2

How to Serve Mistral Medium 3.5 128B Without Running Out of GPU Memory

DEV Community·Alan West·about 1 month ago

#5c7dg9Wz

#llm #machinelearning #model #mistral #memory #128b

Reading 0:00

15s threshold

So you saw Mistral dropped their new open-weight 128B parameter model and thought "I should run this locally." You pulled the weights, fired up your inference server, and immediately got slapped with an OOM error. Yeah. Been there. Serving large dense models is a different beast than the 7B or 13B models most of us cut our teeth on. Mistral Medium 3.5 128B is a fully dense 128 billion parameter model with a 256k token context window, vision capabilities, and native function calling. It's genuinely impressive on benchmarks — but none of that matters if you can't actually get it running. Let me walk through the problems you'll hit and how to solve each one. The Root Cause: Dense Models Are Memory Hogs Here's the fundamental math that ruins your day. A 128B parameter model in BF16 (which is how Mistral ships the weights) requires roughly 256 GB of GPU VRAM just for the model weights. That's before you account for KV cache, activation memory, or any batching overhead. A single H100 has 80 GB of VRAM.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

How to Serve Mistral Medium 3.5 128B Without Running Out of GPU Memory