Menu

Post image 1
Post image 2
1 / 2
0

Beating Eager TurboQuant Was Not Enough: Why Dense GPU Attention Still Won

DEV Community·Alankrit Verma·about 1 month ago
#DqcKD2fq
Reading 0:00
15s threshold

I wanted to answer one question: If I remove eager overhead, can a TurboQuant-style compressed primitive beat dense GPU logits? The short answer: it beat eager TurboQuant, but it did not beat dense FP16 logits by enough. TL;DR Exact weighted value decode was mathematically clean, but only improved value_decode_sec by about 2.9% . A fused packed-codebook logits kernel removed most eager overhead and beat eager TurboQuant main logits by 7x to 18x . It still missed the dense FP16 logits gate: the best K0.2 result was 1.56x at 8192 and 1.99x at 16384, where the gate required >=2x . The lesson was strict: beating your own eager prototype is not enough. The compressed path must beat the real dense baseline with room left for softmax, values, residuals, and updates.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Read More