DynoSim: Simulating the Pareto Frontier

1 / 13

DynoSim: Simulating the Pareto Frontier

NVIDIA Technical Blog·Yongming Ding·2 days ago

#GVzBoSJs

#developer #planner #cache #engine #replay #dynosim

Reading 0:00

15s threshold

Modern LLM serving is hard to tune because each deployment is a stack of interacting choices: model backend, tensor-parallel shape, prefill/decode split, worker counts, scheduler settings, routing policy, KV cache behavior, autoscaling thresholds, and topology. Those choices interact across layers, and a local improvement can shift the bottleneck somewhere else. For larger models, even one realistic experiment can require many GPUs or nodes before we learn whether the idea was worth testing. That is the motivation for DynoSim: a Dynamo twin. DynoSim is a workload-driven discrete-event simulation of the NVIDIA Dynamo serving stack. It combines measured engine forward-pass timing, Mocker scheduler cores, Router, and Planner behavior, KV cache effects and workload traces on one virtual timeline. The goal is not a purely analytical estimate and not a bit-exact hardware emulator.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

DynoSim: Simulating the Pareto Frontier