A Cluster Stall Looks Healthy on Every Host. The Cause Is in the Pattern Across Hosts.

1 / 4

A Cluster Stall Looks Healthy on Every Host. The Cause Is in the Pattern Across Hosts.

DEV Community·Ingero Team·28 days ago

#ADGtJR43

#machinelearning #devops #echo #events #fullscreen #rank

Reading 0:00

15s threshold

Eight ranks on two hosts. Every per-host metric reads healthy. Rank 5 enters the barrier 290ms late. The cause lives in a cross-rank query, not in any single host’s trace. Watch the demo (90 seconds): Echo starts on :4317, two echo-stress invocations push 2,000 events combined (node-a100 + node-h100), then DuckDB verifies they all landed and the causal-chain markers survived the wire. TL;DR Eight ranks on two hosts run an all-reduce. Token throughput drops 4x. Every per-host nvidia-smi reads 95-99% utilization. Every per-host eBPF trace looks clean. The cause is rank 5 on node B taking 380ms on a step that the other seven ranks finish in 90ms. The other seven ranks spend 290ms blocked inside ncclAllReduce , which counts as a running kernel and reports as healthy on every per-host metric. The diagnosis lives in a cross-rank query, not in any single host’s trace.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

A Cluster Stall Looks Healthy on Every Host. The Cause Is in the Pattern Across Hosts.