TurboQuant on a MacBook Pro, part 2: perplexity, KL divergence, and asymmetric K/V on M5 Max

1 / 2

TurboQuant on a MacBook Pro, part 2: perplexity, KL divergence, and asymmetric K/V on M5 Max

DEV Community·Christopher Maher·about 1 month ago

#ZPT3PBRz

#ai #llm #kubernetes #opensource #q8_0 #turbo4

Reading 0:00

15s threshold

Originally published at llmkube.com/blog/turboquant-m5-max-quality-and-asymmetric . Cross-posted here for the dev.to audience. Yesterday's M5 Max KV cache post drew a clean set of asks in the comments: where are the perplexity numbers, what about KL divergence, did you try asymmetric K/V combos, can you fill the 32K to 128K gap with a 64K row. I ran them overnight on the same hardware. Numbers below. TL;DR q8_0 KV cache is essentially free at 4k context. PPL delta vs f16 is −0.0005 (well inside the ±0.036 stderr). KL is 0.0016. Top-1 token agreement is 98.64%. turbo3 and turbo4 cost real but small quality. turbo3: ~1% PPL increase, 5pp top-token disagreement, KL roughly 12× q8_0. turbo4 sits between, in line with its lower compression ratio. -ctk q8_0 -ctv turbo4 is the new winner for long-context. Matches symmetric q8_0 throughput at every depth tested and fits 512K, where symmetric q8_0 OOM'd. q8_0-grade prefill, turbo4-grade memory ceiling. -ctk f16 -ctv turbo4 is broken on this fork on Metal.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

TurboQuant on a MacBook Pro, part 2: perplexity, KL divergence, and asymmetric K/V on M5 Max