vLLM Gemma4 26B Tuning on v6e-4

1 / 2

vLLM Gemma4 26B Tuning on v6e-4

DEV Community·xbill·18 days ago

#VT0ggiHy

#gemmachallenge #gemma #devchallenge #software #model #tokens

Reading 0:00

15s threshold

✦ The successful benchmark run on TPU v6e-4 used the following "Balanced Production" flags. These were specifically tuned to stabilize the 26B MoE model on the 4-chip topology while maintaining peak performance. 🚀 vLLM Startup Command (Verified) 1 vllm serve google/gemma-4-26B-A4B-it \ 2 --tensor-parallel-size 4 \ 3 --dtype bfloat16 \ 4 --kv-cache-dtype fp8 \ 5 --max-model-len 16384 \ 6 --speculative-config '{"method": "ngram", "num_speculative_tokens": 3}' \ 7 --max-num-batched-tokens 4096 \ 8 --max-num-seqs 256 \ 9 --enable-prefix-caching \ Enter fullscreen mode Exit fullscreen mode 10 --disable_chunked_mm_input \ 11 --limit-mm-per-prompt '{"image":4,"audio":1}' \ 12 --enable-auto-tool-choice \ 13 --tool-call-parser gemma4 \ 14 --reasoning-parser gemma4 \ 15 --trust-remote-code ⚙️ Critical Parameters Explained ┌─────────────────────────┬───────┬────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ Flag │ Value │ Rationale │…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

vLLM Gemma4 26B Tuning on v6e-4