Extreme KV‑Cache Compression and Long‑Context Efficiency Static quantization is giving way to rotation‑based and context‑sensitive schemes. OCTOPUS and OScaR reach near‑lossless INT2 performance while cutting cache size dramatically [1] , [2] . Sparse token indexers replace dense caches with a searchable sketch, preserving attention fidelity at lower memory cost [3] . Linear‑attention decoupling splits the KV stream into a short‑term mutable part and a long‑term static part, keeping long‑context reasoning accurate without quadratic growth [4] . Together these ideas let models handle thousands of tokens on modest hardware, a bottleneck for many retrieval‑augmented and multilingual applications. Verifiable Rewards for LLM Reasoning RL from verifiable rewards (RLVR) refines policy updates with token‑level credit signals rather than the coarse GRPO baseline. Discriminative token weighting assigns higher reward to correct intermediate steps, improving math and code accuracy [5] .…