Contents
Tags
DeepSeek V4 landed on April 24, 2026, and the headline numbers are striking: 80.6% on SWE-bench Verified — matching Claude Opus 4.6 and Gemini 3.1 Pro — with a 1-million-token context window. But the real story isn't the benchmark. It's that at 1 million tokens of context, V4-Pro uses only 27% of the inference compute and 10% of the KV cache that its predecessor V3.2 required. That's not a tweak. That's a rethink of how attention works. This post explains the specific mechanisms that make this possible — and why it matters for anyone running local AI on consumer hardware.
Every transformer model maintains a Key-Value (KV) cache during inference. When your model generates a response, it needs to reference every previous token in the context — and the KV cache is how it stores that memory efficiently. The problem is fundamental: the cache grows linearly with sequence length. The longer the context, the more memory you need.
For a standard multi-head attention layer, the KV cache memory per token is determined by a straightforward formula:
KV cache memory per token =
2 # key + value
× head_dimension # typically 128
× bytes_per_value # FP16 = 2 bytes
× num_heads # typically 128
× num_layers # typically 96-128
# Example: typical large model (H=128 heads, L=96 layers, FP16):
= 2 × 128 × 2 × 128 × 96
= ~6.3 MB per token
# At 100K token context:
= 6.3 MB × 100,000 = 630 GB ← doesn't fit anywhere locally
# At 32K context (more realistic):
= 6.3 MB × 32,000 = ~200 GB ← still impossible on consumer hardwareThis is why long context is expensive. Even at 32K tokens — which sounds reasonable — the KV cache alone exceeds the VRAM of any consumer GPU. The model weights haven't even entered the calculation yet.
Before DeepSeek V4, the standard approach to reducing KV cache was to reduce the number of heads on the key and value matrices. Multi-head attention (MHA) uses a separate K and V for every attention head. Grouped Query Attention (GQA) shares one K/V pair across a group of query heads. Multi-Query Attention (MQA) goes all the way — a single K/V pair shared across every query head.
MQA cuts KV cache from ~4 MB to ~31 KB per token — a 128× reduction. That's genuinely useful. But there's a cost: sharing a single KV pair across all 128 query heads means all heads see the same context. The model loses its ability to track different types of relationships in parallel. Benchmark quality drops, especially on tasks that require attending to multiple different patterns at once.
GQA (used by Llama 3.1, Mistral, Gemma) is the practical middle ground: it reduces heads to 8 groups, cutting KV cache by 16× while keeping most of MHA's quality. Most open-source models you run locally use GQA precisely because it's the best head-reduction tradeoff. DeepSeek V4 takes a completely different approach — instead of reducing across heads, it compresses across tokens.
All of the head-reduction approaches — MQA, GQA — are variations on the same theme: compress the K/V matrices by sharing them across attention heads. DeepSeek V4 shifts the axis of compression entirely. Instead of asking "how many heads do we need?", it asks "how many tokens do we actually need to attend to?"
The key insight is that in a 1-million-token context, most tokens are not equally important to any given query. A token asking about a bug in line 47,000 of a codebase probably does not need to attend with full resolution to line 3,000. It needs global context — a compressed understanding of what came before — plus precise local context near line 47,000. DeepSeek V4 exploits this asymmetry.
Before storing tokens into the KV cache, V4 runs a token-level compressor that groups consecutive tokens and merges them into a single compressed KV entry. Crucially, this is not simple averaging. The compression uses learned weights — a softmax-gated pooling with a learned positional bias — that decide how much each token contributes to the compressed entry. The weighting can even vary per dimension, giving fine-grained control over which information to keep.
To avoid sharp information loss at group boundaries, the compressor uses overlapping windows — each compressed entry draws from a range that overlaps with its neighbors. Information near a boundary is preserved in both adjacent entries, preventing the model from ever seeing a "hard cut" in the context stream.
DeepSeek V4 uses the token compressor at two different compression ratios, creating two distinct attention modes that serve different purposes in the model:
CSA compresses every 4 tokens into one KV entry — a 4× reduction in sequence length. Then it adds sparsity on top: instead of attending to all compressed entries, each query token uses a "lightning indexer" (a fast FP4-precision dot product with ReLU scoring) to select only the top 1,024 most relevant compressed KV entries to attend to. You get detailed, high-fidelity context — but only for the parts of the sequence that actually matter to the current query.
HCA compresses every 128 tokens into one KV entry — a 128× reduction. Then it applies dense attention (no sparse selection) over all the compressed entries. The resulting context window is tiny in terms of entries, but spans the entire sequence. HCA gives the model a high-level map of everything that's been said, at very low compute cost.
The real elegance is how V4 mixes these two modes across the transformer's layers. Different layers in a transformer play different roles — early layers build basic representations, middle layers refine and relate concepts, final layers produce precise outputs. V4's hybrid architecture matches the attention type to the layer's job:
This gradient from cheap-and-global to expensive-and-precise mirrors how a human expert reads a long document: skim for structure first, then re-read the sections that matter for the specific question, then formulate a careful answer with full attention on the relevant details.
Across the full 1-million-token context, the CSA+HCA hybrid delivers substantial efficiency gains versus DeepSeek V3.2, which already used the previous state-of-the-art Multi-head Latent Attention (MLA):
The 90% KV cache reduction is relative to V3.2, which already used MLA (Multi-head Latent Attention) — itself a major improvement over standard MHA. V4's savings stack on top of an already highly optimized baseline. Compared to a naive standard MHA implementation, the real-world reduction is closer to 99.9% at 1M context.
The practical implication for local AI users is direct: more efficient attention means less VRAM consumed by the context window, which means more room for larger models, longer contexts, or both. Every advance in KV cache efficiency shifts the model-runs-locally threshold downward — models and context lengths that were previously impossible on consumer hardware become feasible.
The V4-Pro and V4-Flash flagship models themselves still require serious multi-GPU hardware to run locally (V4-Flash needs at minimum ~77GB at IQ2 quantization — roughly a Mac Studio M4 Ultra). But the architectural principles of CSA and HCA are not exclusive to DeepSeek. They will propagate into smaller, distilled, and fine-tuned models over the coming months — just as MLA from DeepSeek V2 influenced downstream architectures across the open-source ecosystem.
When distilled V4 variants at 7B and 14B land — likely by June 2026 — they will carry V4's attention efficiency at consumer-hardware sizes. A 14B model with CSA-style attention running at 32K or 64K context on an RTX 4070 will be qualitatively different from the current generation of 14B models hitting VRAM limits at 8K context.
Understanding the architecture is one thing. Knowing whether a specific model — at a specific quantization and context length — will actually fit in your GPU is another. That's exactly what Runyard's VRAM Calculator is built for. Enter your GPU and target context length, pick a model from the catalog, and get an instant answer on whether it fits and at what quantization. When the V4 distills appear, you'll know immediately which variant to pull — before committing to a 10–20GB download.
Context length dramatically changes VRAM requirements. A 7B model at Q4_K_M uses ~4.7GB for weights — but at 32K context, the KV cache adds another 1–2GB. At 128K context, it adds 6–8GB. Models with efficient attention (like V4 distills) compress that KV cache, so they hit longer contexts on the same VRAM. The VRAM Calculator at www.runyard.dev/tools/vram-calculator accounts for both weights and context overhead.
Calculate exactly how much VRAM any model needs at any context length on your specific GPU.
Open the VRAM Calculator → →Tools
Find AI models that fit your exact hardware. Enter your specs and get a ranked list instantly.
Newsletter