Fine‑tuning a 10 B‑parameter model on a single RTX 4090 feels like watching paint dry—most of the GPU sits idle while a handful of layers chew through memory, and the whole job stalls at a crawl. The bottleneck isn’t the raw FLOPs; it’s the rigid coupling between model weights and the slots you allocate on the device. Pipeline parallelism was supposed to solve that, but conventional schedules bind each model stage to a fixed GPU. When a heavyweight head sits on one card, that card becomes the choke point and bubbles waste up to 30 % of the pipeline’s capacity [1] . The cache that powers autoregressive generation suffers a similar fate: each layer hoards its own key‑value memory, ballooning the footprint and throttling batch size. RoundPipe breaks the binding entirely. “RoundPipe treats GPUs as a pool of stateless execution workers and dynamically dispatches computation stages across devices in a round‑robin manner, achieving a near‑zero‑bubble pipeline” [1] .…