Menu

Post image 1
Post image 2
Post image 3
1 / 3
0

Running Large-Scale GPU Workloads on Kubernetes with Slurm

NVIDIA Technical Blog·Anton Polyakov·about 1 month ago
#BIO78nP9
Reading 0:00
15s threshold

Slurm is an open source cluster management and job scheduling system for Linux. It manages job scheduling for over 65% of TOP500 systems . Most organizations running large-scale AI training have years of investment in Slurm job scripts, fair-share policies, and accounting workflows. The challenge is getting Slurm scheduling capabilities onto Kubernetes—the standard platform for managing GPU infrastructure at scale—without maintaining two separate environments. Slinky , an open source project developed by SchedMD (now part of NVIDIA), takes two approaches to this integration: slurm-bridge brings Slurm scheduling to native Kubernetes workloads, allowing Slurm to act as a Kubernetes scheduler for pods slurm-operator runs full Slurm clusters on Kubernetes infrastructure, managing the complete lifecycle of Slurm daemons as pods  This post focuses on the slurm-operator, which is how NVIDIA runs Slurm on Kubernetes for large-scale GPU training clusters.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More