KVQuant: real terminal proof for KV-cache compression KVQuant is a cache-compression layer for long-context inference. The interesting bit is not the idea — lots of projects have that — but whether it survives contact with a real model, a real terminal, and a real benchmark table. This write-up is the boring but useful version: what it does, what I ran, what the numbers were, and where it helps or doesn’t. Why KV cache matters When a model generates text, it keeps a memory of previous tokens in the KV cache . That cache grows with every step. Weight quantisation shrinks the model weights, but it doesn’t directly touch this memory tax. KVQuant targets that cache directly: Allocate fewer bits for older tokens Pack the cache into smaller storage Restore it before the next forward pass That gives you a real memory win on long-running chats and long-context inference.…