Running AI Workloads on Rack-Scale Supercomputers: From Hardware to Topology-Aware Scheduling

1 / 11

Running AI Workloads on Rack-Scale Supercomputers: From Hardware to Topology-Aware Scheduling

NVIDIA Technical Blog·Ryan Prout·about 1 month ago

#QZ7vp1oq

#x2d #datacentercloud #developertoolstechniques #networkingcommunications #general #nvlink

Reading 0:00

15s threshold

The NVIDIA GB200 NVL72 and NVIDIA GB300 NVL72 systems, featuring NVIDIA Blackwell architecture, are rack-scale supercomputers. They’re designed with 18 tightly coupled compute trays, massive GPU fabrics, and high-bandwidth networking packaged as a unit.   For AI architects and HPC platform operators, the challenge isn’t just racking and stacking hardware—it’s turning infrastructure into safe, performant, and easy-to-use resources for end users. The mismatch between rack-scale hardware topology and scheduler abstractions is where most of the operational complexity lives. Left unaddressed, schedulers operate on a flat pool of GPUs and nodes, overlooking the system’s hierarchical and topology-sensitive design. This is the gap that a validated software stack, such as NVIDIA Mission Control , is designed to bridge. Mission Control provides ‌rack-scale control planes for NVIDIA Grace Blackwell NVL72 systems .…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

Running AI Workloads on Rack-Scale Supercomputers: From Hardware to Topology-Aware Scheduling