I spent a day deploying vLLM on GKE with TPU v5e. Here's the full guide - quota, capacity, Gemma …

📰

I spent a day deploying vLLM on GKE with TPU v5e. Here's the full guide - quota, capacity, Gemma 4 testing, and autoscaling

Reddit r/googlecloud·u/xprilion·about 1 month ago

#vllm #autoscaling #xprilion #gemma3 #article #discussion

Reading 0:00

15s threshold

I spent a day deploying vLLM on GKE with TPU v5e. Here's the full guide - quota, capacity, Gemma 4 testing, and autoscaling I recently went through the process of setting up autoscaling LLM inference on GKE using Cloud TPU v5e and vLLM. The experience was educational enough that I wrote a detailed guide covering everything I encountered. What the guide covers: \- How TPU quota actually works on GCP (there are three independent gates, and one of them is called GPUS\_ALL\_REGIONS - which blocks TPUs despite the name) \- Scanning zones for capacity and the right strategy when everything is exhausted \- The correct GKE syntax for TPU node pools (--machine-type, not --accelerator) \- Testing Gemma 4 (E2B, E4B, 26B-A4B) on vLLM's TPU backend - none work today due to a shared layers limitation \- Full HPA autoscaling setup using Managed Prometheus and vLLM's num\_requests\_waiting metric What I deployed: Gemma 3 4B on a single TPU v5e chip with the complete autoscaling stack proven and working.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

I spent a day deploying vLLM on GKE with TPU v5e. Here's the full guide - quota, capacity, Gemma 4 testing, and autoscaling