vLLM on Google Cloud TPU: A Model Size vs Chip Cheat Sheet (With Interactive Tool)

1 / 6

vLLM on Google Cloud TPU: A Model Size vs Chip Cheat Sheet (With Interactive Tool)

DEV Community·Grace Gong·about 1 month ago

#VgqiA57i

#tpu #tpusprint #googlecloud #vllm #model #chip

Reading 0:00

15s threshold

Picking a Cloud TPU slice for vLLM inference involves three decisions that most tutorials skip over: how much HBM your model actually needs at runtime, how many chips to shard across, and whether the cost is justified for your workload. Get it wrong in either direction and you're either OOMing on startup or paying for memory you're not using. This post walks through how to make that decision, with a reference table for popular models and a live interactive tool where you can select your model, toggle precision, and see exactly which TPU configurations fit and what they cost. Try the interactive cheat sheet here: ggongg.github.io/vllm-tpu-notes The site is a mini-project based on data pulled from 4/30/2026 and may change. Please refer to the official docs & site and double check.  What is vLLM and why run it on a TPU? vLLM is an open-source LLM inference engine built for high-throughput, memory-efficient serving.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

vLLM on Google Cloud TPU: A Model Size vs Chip Cheat Sheet (With Interactive Tool)