Menu

Post image 1
Post image 2
1 / 2
0

KV FP8 with Gemma4 26B

DEV Community·xbill·19 days ago
#JUUOU9PK
Reading 0:00
15s threshold

Gemma 4 Challenge: Write about Gemma 4 Submission ✦ The vLLM service is now Online and healthy! 🟢 Final Status: vLLM Health: 🟢 200 OK Active Endpoint: http://34.95.135.58:8000 Model: google/gemma-4-26B-A4B-it Optimizations: KV FP8 Enabled, bfloat16, Speculative Decoding (ngram). Key Observations High Prefill Throughput: The TPU v6e cluster scaled efficiently under load. At max concurrency (1024 users) with a 16,384 context length, it hit an impressive 475,552 tokens per second (tok/s) prefill rate. TTFT Scaling: Time-to-first-token gracefully increased as expected with concurrency. Single concurrency at 16k context was ~1.1 seconds, while 1024 users at 16k context yielded an average TTFT of ~19.2 seconds. Max Context Limit Exception: The test for 32,768 tokens failed across all concurrency sweeps with an HTTP 400 error. This occurs because the max_model_len is explicitly set to 32768 on the vLLM server, and the benchmark asks for 1 token of generation.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More