Running Slurm in the cloud sounds simple at first: spin up some VMs, install Slurm, and start submitting jobs. In reality, cloud-based HPC introduces a different set of design decisions and trade-offs compared to on-prem clusters. If the architecture is not planned properly, costs increase quickly and performance can drop. This guide walks through a typical Slurm architecture on AWS/Azure and highlights the most common pitfalls. Why Run Slurm in the Cloud? Common reasons include: On-demand scaling for peak workloads No upfront hardware investment Access to GPU instances when needed Flexibility for short-term projects However, cloud HPC is not always cheaper or faster β it depends heavily on how it is configured. Typical Slurm Architecture in Cloud A standard setup usually includes: 1. Head Node (Controller) Runs slurmctld Manages scheduling and job queues Typically a small-to-medium VM Key Point: This node should be stable and always available. 2.β¦