Which serverless GPU platforms actually have fast cold starts for AI inference — p99, not p50

1 / 2

Which serverless GPU platforms actually have fast cold starts for AI inference — p99, not p50

DEV Community·yukixing6-star·22 days ago

#YLGkeqMo

#gpu #machinelearning #infrastructure #devops #provider #queue

Reading 0:00

15s threshold

been testing this properly for a few months because i kept seeing wildly different claims and couldn’t find real data anywhere. specifically for inference workloads, 70B class models, and i care about p99 not p50 because p99 is what shows up in user complaints, not the median. the thing nobody explains clearly: cold start has two components. model loading time — which is roughly fixed based on model size and doesn’t vary much across platforms — and infrastructure queue time, which is where all the variance actually lives. most platform benchmarks conflate these two things and publish a number that looks great but doesn’t reflect what happens when their infrastructure is under load. what i actually found testing across platforms: the platforms running single-provider infrastructure have p99 cold start that degrades meaningfully when that provider is at high utilization. you’re waiting in their queue, and when the queue is long, p99 spikes.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

Which serverless GPU platforms actually have fast cold starts for AI inference — p99, not p50