Picking a Cloud TPU slice for vLLM inference involves three decisions that most tutorials skip over: how much HBM your model actually needs at runtime, how many chips to shard across, and whether the cost is justified for your workload. Get it wrong in either direction and you're either OOMing on startup or paying for memory you're not using. This post walks through how to make that decision, with a reference table for popular models and a live interactive tool where you can select your model, toggle precision, and see exactly which TPU configurations fit and what they cost. Try the interactive cheat sheet here: ggongg.github.io/vllm-tpu-notes The site is a mini-project based on data pulled from 4/30/2026 and may change. Please refer to the official docs & site and double check. What is vLLM and why run it on a TPU? vLLM is an open-source LLM inference engine built for high-throughput, memory-efficient serving.…