I spent a day deploying vLLM on GKE with TPU v5e. Here's the full guide - quota, capacity, Gemma 4 testing, and autoscaling I recently went through the process of setting up autoscaling LLM inference on GKE using Cloud TPU v5e and vLLM. The experience was educational enough that I wrote a detailed guide covering everything I encountered. What the guide covers: \- How TPU quota actually works on GCP (there are three independent gates, and one of them is called GPUS\_ALL\_REGIONS - which blocks TPUs despite the name) \- Scanning zones for capacity and the right strategy when everything is exhausted \- The correct GKE syntax for TPU node pools (--machine-type, not --accelerator) \- Testing Gemma 4 (E2B, E4B, 26B-A4B) on vLLM's TPU backend - none work today due to a shared layers limitation \- Full HPA autoscaling setup using Managed Prometheus and vLLM's num\_requests\_waiting metric What I deployed: Gemma 3 4B on a single TPU v5e chip with the complete autoscaling stack proven and working.…