DeepSeek V4's Real Innovation Isn't Scale—It's Memory Architecture

📰

DeepSeek V4's Real Innovation Isn't Scale—It's Memory Architecture

DEV Community·Aamer Mihaysi·about 1 month ago

#ai #agents #deepseek #llm #context #attention

Reading 0:00

15s threshold

The announcement of DeepSeek V4 landed with predictable fanfare about parameter counts and benchmark scores. 1.6T parameters, 1M token context, competitive with GPT-5.4 and Opus 4.7. But the headline numbers obscure something more significant: this is the first open-weight model that makes million-token context actually usable for agents. Not theoretically. Actually. The difference lies in KV cache compression. At 1M tokens, DeepSeek V4 requires 9.62 GiB of memory per sequence in BF16. Compare that to DeepSeek V3.2's 83.9 GiB—a nearly 9x reduction. Achieved through what they call Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA), alternating layers that apply different compression ratios: 4x for nearby context, 128x for distant tokens, with shared key-value vectors and top-k sparse attention over compressed representations. This matters because long-context has been the domain of demos, not production. Everyone claims to support 1M tokens. Almost nobody can afford to use them.…

Continue reading — create a free account

Join HashtagPLUS to read full articles, follow hashtags, vote, and join the conversation.

Create free account Log in

Menu

DeepSeek V4's Real Innovation Isn't Scale—It's Memory Architecture