Menu

Post image 1
Post image 2
Post image 3
Post image 4
1 / 4
0

A Cluster Stall Looks Healthy on Every Host. The Cause Is in the Pattern Across Hosts.

DEV Community·Ingero Team·28 days ago
#ADGtJR43
Reading 0:00
15s threshold

Eight ranks on two hosts. Every per-host metric reads healthy. Rank 5 enters the barrier 290ms late. The cause lives in a cross-rank query, not in any single host’s trace. Watch the demo (90 seconds): Echo starts on :4317, two echo-stress invocations push 2,000 events combined (node-a100 + node-h100), then DuckDB verifies they all landed and the causal-chain markers survived the wire. TL;DR Eight ranks on two hosts run an all-reduce. Token throughput drops 4x. Every per-host nvidia-smi reads 95-99% utilization. Every per-host eBPF trace looks clean. The cause is rank 5 on node B taking 380ms on a step that the other seven ranks finish in 90ms. The other seven ranks spend 290ms blocked inside ncclAllReduce , which counts as a running kernel and reports as healthy on every per-host metric. The diagnosis lives in a cross-rank query, not in any single host’s trace.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More