Achieving Peak System and Workload Efficiency on NVIDIA GB200 NVL72 with Slurm Block Scheduling

1 / 13

Achieving Peak System and Workload Efficiency on NVIDIA GB200 NVL72 with Slurm Block Scheduling

NVIDIA Technical Blog·Felix Abecassis·25 days ago

#2eH9ttbQ

#x5b #datacentercloud #mlops #networkingcommunications #hpcscientificcomputing #slurm

Reading 0:00

15s threshold

NVIDIA GB200 NVL72 introduces a fundamentally new way to build GPU clusters by extending NVIDIA NVLink coherence across an entire rack. This design enables exascale performance, but it also changes the assumptions that many scheduling systems were built on. As a result, “rack-scale locality” becomes a hard constraint. When workloads cross domain boundaries, performance drops sharply, and a scheduler that treats the network fabric as a best-effort tree topology will fragment allocations in ways that increase queue times and degrade application performance. To address this, Slurm workload manager introduced the topology/block plugin and continues expanding its capabilities with segmented scheduling. The plugin enables administrators and users to express application-specific NVLink requirements as atomic blocks rather than loosely optimized allocations.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

Achieving Peak System and Workload Efficiency on NVIDIA GB200 NVL72 with Slurm Block Scheduling