Menu

Post image 1
Post image 2
1 / 2
0

tierKV: A Distributed KV Cache That Makes Evicted Blocks Faster to Restore Than GPU Cache Hits

DEV Community·prasanna kanagasabai·24 days ago
#Yw9JnaUL
#when#llm#rust#machinelearning#tierkv#vault
Reading 0:00
15s threshold

The Problem When your GPU's KV cache fills up, inference engines evict blocks and discard them. The next request with the same prefix re-runs full prefill from scratch — quadratic in sequence length. On a 30,000-token document that's 10+ seconds, every single time the same prompt reappears. tierKV intercepts evicted KV blocks, quantizes them, ships them to a vault on a LAN machine, and restores them on the next cache miss — injecting directly into vLLM's paged KV buffer with no attention recomputation. It integrates via vLLM's KVConnectorBase_V1 plugin API with no source changes required. Benchmarks (Qwen3.6-35B-A3B, Apple FY2025 10-K, 30,561 tokens) We ran the Apple FY2025 10-K filing through three scenarios. A full cold prefill with no cache took 10.75 seconds . A GPU cache hit (blocks already in VRAM) dropped that to 1.19 seconds . The cold vault restore came in at 0.52 seconds — 20× faster than a full prefill, and faster than the GPU cache hit.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More