Beating Eager TurboQuant Was Not Enough: Why Dense GPU Attention Still Won

1 / 2

Beating Eager TurboQuant Was Not Enough: Why Dense GPU Attention Still Won

DEV Community·Alankrit Verma·about 1 month ago

#DqcKD2fq

#machinelearning #gpu #logits #dense #fullscreen #eager

Reading 0:00

15s threshold

I wanted to answer one question: If I remove eager overhead, can a TurboQuant-style compressed primitive beat dense GPU logits? The short answer: it beat eager TurboQuant, but it did not beat dense FP16 logits by enough. TL;DR Exact weighted value decode was mathematically clean, but only improved value_decode_sec by about 2.9% . A fused packed-codebook logits kernel removed most eager overhead and beat eager TurboQuant main logits by 7x to 18x . It still missed the dense FP16 logits gate: the best K0.2 result was 1.56x at 8192 and 1.99x at 16384, where the gate required >=2x . The lesson was strict: beating your own eager prototype is not enough. The compressed path must beat the real dense baseline with room left for softmax, values, residuals, and updates.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

Beating Eager TurboQuant Was Not Enough: Why Dense GPU Attention Still Won