I wanted to answer one question: If I remove eager overhead, can a TurboQuant-style compressed primitive beat dense GPU logits? The short answer: it beat eager TurboQuant, but it did not beat dense FP16 logits by enough. TL;DR Exact weighted value decode was mathematically clean, but only improved value_decode_sec by about 2.9% . A fused packed-codebook logits kernel removed most eager overhead and beat eager TurboQuant main logits by 7x to 18x . It still missed the dense FP16 logits gate: the best K0.2 result was 1.56x at 8192 and 1.99x at 16384, where the gate required >=2x . The lesson was strict: beating your own eager prototype is not enough. The compressed path must beat the real dense baseline with room left for softmax, values, residuals, and updates.…