been testing this properly for a few months because i kept seeing wildly different claims and couldn’t find real data anywhere. specifically for inference workloads, 70B class models, and i care about p99 not p50 because p99 is what shows up in user complaints, not the median. the thing nobody explains clearly: cold start has two components. model loading time — which is roughly fixed based on model size and doesn’t vary much across platforms — and infrastructure queue time, which is where all the variance actually lives. most platform benchmarks conflate these two things and publish a number that looks great but doesn’t reflect what happens when their infrastructure is under load. what i actually found testing across platforms: the platforms running single-provider infrastructure have p99 cold start that degrades meaningfully when that provider is at high utilization. you’re waiting in their queue, and when the queue is long, p99 spikes.…