nvidia-smi reads 97% the entire window. The red gaps in the cause-side timeline are the throughput the GPU lost while the counter sat green. TL;DR A vLLM server reads 97% GPU utilization on nvidia-smi for an 8-minute window. Token throughput drops 3x in the middle of that window. Both statements are true, and both come from the same workload. The reason is that GPU utilization as nvidia-smi reports it is a duty-cycle counter (percent of time at least one kernel was running), not a measure of useful work. Five different failure modes score 100% on that counter while throughput collapses. Causal observability lives in the layer below: kernel runtime distributions, off-CPU time on the dispatcher thread, NCCL waits, I/O stalls. The mystery We were running an internal repro of a vLLM latency spike on a TensorDock RTX 4090 (vLLM 0.18.0, Qwen2.5-0.5B-Instruct).…