The Problem When your GPU's KV cache fills up, inference engines evict blocks and discard them. The next request with the same prefix re-runs full prefill from scratch — quadratic in sequence length. On a 30,000-token document that's 10+ seconds, every single time the same prompt reappears. tierKV intercepts evicted KV blocks, quantizes them, ships them to a vault on a LAN machine, and restores them on the next cache miss — injecting directly into vLLM's paged KV buffer with no attention recomputation. It integrates via vLLM's KVConnectorBase_V1 plugin API with no source changes required. Benchmarks (Qwen3.6-35B-A3B, Apple FY2025 10-K, 30,561 tokens) We ran the Apple FY2025 10-K filing through three scenarios. A full cold prefill with no cache took 10.75 seconds . A GPU cache hit (blocks already in VRAM) dropped that to 1.19 seconds . The cold vault restore came in at 0.52 seconds — 20× faster than a full prefill, and faster than the GPU cache hit.…