Runyard is a free hardware-aware AI model browser. You enter your CPU, GPU, and VRAM and it instantly shows every local LLM that will run on your machine, ranked by speed and quality.

How much VRAM do I need to run local LLMs?

8GB of VRAM runs 7B models like Llama 3.1 8B and Mistral 7B at Q4 quantization. 16GB unlocks 13B models. 24GB lets you run Mixtral 8x7B and Llama 3 70B at lower quantization.

What is the best local LLM for my GPU?

Use Runyard at www.runyard.dev — enter your GPU and VRAM and the Model Radar will rank every compatible LLM for your exact hardware, showing estimated tokens per second for each model.

Can I run Llama 3 locally?

Yes. Llama 3.1 8B at Q4 runs on any 8GB VRAM GPU. Llama 3.1 70B needs around 40GB VRAM at Q4, or an Apple Silicon Mac with 64GB+ unified memory.

← Blog/How DeepSeek V4 Fits 1 Million Tokens Into Your GPU: CSA and HCA Explained

May 1, 2026deep-dive

Runyard Team

@runyard_dev

13 min read

Contents

▸The Core Problem: KV Cache Grows Without Bound ▸The Standard Fixes: MHA, GQA, and MQA ▸DeepSeek V4's New Idea: Compress Across Tokens, Not Just Heads ▸The Token-Level Compressor ▸Two Modes of Compression: CSA and HCA ▸The Layer-by-Layer Architecture Strategy ▸The Numbers: What This Buys You ▸What This Means for Local AI and Your VRAM ▸Know What Fits Your GPU Before the Distills Land

How DeepSeek V4 Fits 1 Million Tokens Into Your GPU: CSA and HCA Explained

Neural network architecture visualization — DeepSeek V4 attention mechanism deep dive — DeepSeek V4's hybrid attention architecture cuts KV cache memory by 90% at 1M token context — without sacrificing benchmark quality.

DeepSeek V4 landed on April 24, 2026, and the headline numbers are striking: 80.6% on SWE-bench Verified — matching Claude Opus 4.6 and Gemini 3.1 Pro — with a 1-million-token context window. But the real story isn't the benchmark. It's that at 1 million tokens of context, V4-Pro uses only 27% of the inference compute and 10% of the KV cache that its predecessor V3.2 required. That's not a tweak. That's a rethink of how attention works. This post explains the specific mechanisms that make this possible — and why it matters for anyone running local AI on consumer hardware.

The Core Problem: KV Cache Grows Without Bound

Every transformer model maintains a Key-Value (KV) cache during inference. When your model generates a response, it needs to reference every previous token in the context — and the KV cache is how it stores that memory efficiently. The problem is fundamental: the cache grows linearly with sequence length. The longer the context, the more memory you need.

For a standard multi-head attention layer, the KV cache memory per token is determined by a straightforward formula:

kv-cache-formulatext

KV cache memory per token =
  2                    # key + value
  × head_dimension     # typically 128
  × bytes_per_value    # FP16 = 2 bytes
  × num_heads          # typically 128
  × num_layers         # typically 96-128

# Example: typical large model (H=128 heads, L=96 layers, FP16):
= 2 × 128 × 2 × 128 × 96
= ~6.3 MB per token

# At 100K token context:
= 6.3 MB × 100,000 = 630 GB  ← doesn't fit anywhere locally

# At 32K context (more realistic):
= 6.3 MB × 32,000 = ~200 GB  ← still impossible on consumer hardware

This is why long context is expensive. Even at 32K tokens — which sounds reasonable — the KV cache alone exceeds the VRAM of any consumer GPU. The model weights haven't even entered the calculation yet.

The Standard Fixes: MHA, GQA, and MQA

Before DeepSeek V4, the standard approach to reducing KV cache was to reduce the number of heads on the key and value matrices. Multi-head attention (MHA) uses a separate K and V for every attention head. Grouped Query Attention (GQA) shares one K/V pair across a group of query heads. Multi-Query Attention (MQA) goes all the way — a single K/V pair shared across every query head.

KV Cache Memory Per Token — Attention Mechanism Comparison

Standard MHA (128 heads)

4000KB

Grouped Query Attention (GQA)

250KB

Multi-Query Attention (MQA)

31KB

DeepSeek V4 CSA (4× compress)

8KB

DeepSeek V4 HCA (128× compress)

0.24KB

MQA cuts KV cache from ~4 MB to ~31 KB per token — a 128× reduction. That's genuinely useful. But there's a cost: sharing a single KV pair across all 128 query heads means all heads see the same context. The model loses its ability to track different types of relationships in parallel. Benchmark quality drops, especially on tasks that require attending to multiple different patterns at once.

GQA (used by Llama 3.1, Mistral, Gemma) is the practical middle ground: it reduces heads to 8 groups, cutting KV cache by 16× while keeping most of MHA's quality. Most open-source models you run locally use GQA precisely because it's the best head-reduction tradeoff. DeepSeek V4 takes a completely different approach — instead of reducing across heads, it compresses across tokens.

DeepSeek V4's New Idea: Compress Across Tokens, Not Just Heads

All of the head-reduction approaches — MQA, GQA — are variations on the same theme: compress the K/V matrices by sharing them across attention heads. DeepSeek V4 shifts the axis of compression entirely. Instead of asking "how many heads do we need?", it asks "how many tokens do we actually need to attend to?"

The key insight is that in a 1-million-token context, most tokens are not equally important to any given query. A token asking about a bug in line 47,000 of a codebase probably does not need to attend with full resolution to line 3,000. It needs global context — a compressed understanding of what came before — plus precise local context near line 47,000. DeepSeek V4 exploits this asymmetry.

The Token-Level Compressor

Before storing tokens into the KV cache, V4 runs a token-level compressor that groups consecutive tokens and merges them into a single compressed KV entry. Crucially, this is not simple averaging. The compression uses learned weights — a softmax-gated pooling with a learned positional bias — that decide how much each token contributes to the compressed entry. The weighting can even vary per dimension, giving fine-grained control over which information to keep.

To avoid sharp information loss at group boundaries, the compressor uses overlapping windows — each compressed entry draws from a range that overlaps with its neighbors. Information near a boundary is preserved in both adjacent entries, preventing the model from ever seeing a "hard cut" in the context stream.

▸Softmax-gated pooling — learned importance weights, not fixed averages
▸Per-dimension weighting — the model decides which features to preserve at fine granularity
▸Overlapping windows — smooth information flow, no hard boundaries between groups
▸Learned positional bias — preserves position-aware context across compression

Two Modes of Compression: CSA and HCA

DeepSeek V4 uses the token compressor at two different compression ratios, creating two distinct attention modes that serve different purposes in the model:

Compressed Sparse Attention (CSA) — Detailed Local Context

CSA compresses every 4 tokens into one KV entry — a 4× reduction in sequence length. Then it adds sparsity on top: instead of attending to all compressed entries, each query token uses a "lightning indexer" (a fast FP4-precision dot product with ReLU scoring) to select only the top 1,024 most relevant compressed KV entries to attend to. You get detailed, high-fidelity context — but only for the parts of the sequence that actually matter to the current query.

Heavily Compressed Attention (HCA) — Cheap Global Context

HCA compresses every 128 tokens into one KV entry — a 128× reduction. Then it applies dense attention (no sparse selection) over all the compressed entries. The resulting context window is tiny in terms of entries, but spans the entire sequence. HCA gives the model a high-level map of everything that's been said, at very low compute cost.

▸CSA — 4× token compression, sparse selection of top-1024 entries. Detailed signal on relevant past tokens.
▸HCA — 128× token compression, dense attention over all entries. Global context at minimal cost.
▸Combined: the model knows everything at low resolution (HCA) and attends with precision to what matters (CSA).

The Layer-by-Layer Architecture Strategy

The real elegance is how V4 mixes these two modes across the transformer's layers. Different layers in a transformer play different roles — early layers build basic representations, middle layers refine and relate concepts, final layers produce precise outputs. V4's hybrid architecture matches the attention type to the layer's job:

1.Early layers — HCA only. Cheap global understanding. The model builds a coarse map of the full context at minimal compute cost.
2.Middle layers — alternating HCA + CSA. Iterative refinement. Global context guides which local details to attend to precisely.
3.Final layer — full attention. Maximum precision for output generation. No compression on the last layer.

This gradient from cheap-and-global to expensive-and-precise mirrors how a human expert reads a long document: skim for structure first, then re-read the sections that matter for the specific question, then formulate a careful answer with full attention on the relevant details.

The Numbers: What This Buys You

Across the full 1-million-token context, the CSA+HCA hybrid delivers substantial efficiency gains versus DeepSeek V3.2, which already used the previous state-of-the-art Multi-head Latent Attention (MLA):

▸27% of inference FLOPs at 1M context — V4 does the same forward pass with 73% less compute than V3.2
▸10% of KV cache memory at 1M context — 90% less VRAM consumed by the context window vs V3.2
▸80.6% on SWE-bench Verified — matching Claude Opus 4.6 and Gemini 3.1 Pro, despite the efficiency cuts
▸93.5 on LiveCodeBench — the highest coding benchmark score of any model at launch
▸1M token context window — practical for entire codebases, long documents, and multi-turn agent loops

The 90% KV cache reduction is relative to V3.2, which already used MLA (Multi-head Latent Attention) — itself a major improvement over standard MHA. V4's savings stack on top of an already highly optimized baseline. Compared to a naive standard MHA implementation, the real-world reduction is closer to 99.9% at 1M context.

What This Means for Local AI and Your VRAM

The practical implication for local AI users is direct: more efficient attention means less VRAM consumed by the context window, which means more room for larger models, longer contexts, or both. Every advance in KV cache efficiency shifts the model-runs-locally threshold downward — models and context lengths that were previously impossible on consumer hardware become feasible.

The V4-Pro and V4-Flash flagship models themselves still require serious multi-GPU hardware to run locally (V4-Flash needs at minimum ~77GB at IQ2 quantization — roughly a Mac Studio M4 Ultra). But the architectural principles of CSA and HCA are not exclusive to DeepSeek. They will propagate into smaller, distilled, and fine-tuned models over the coming months — just as MLA from DeepSeek V2 influenced downstream architectures across the open-source ecosystem.

When distilled V4 variants at 7B and 14B land — likely by June 2026 — they will carry V4's attention efficiency at consumer-hardware sizes. A 14B model with CSA-style attention running at 32K or 64K context on an RTX 4070 will be qualitatively different from the current generation of 14B models hitting VRAM limits at 8K context.

Know What Fits Your GPU Before the Distills Land

Understanding the architecture is one thing. Knowing whether a specific model — at a specific quantization and context length — will actually fit in your GPU is another. That's exactly what Runyard's VRAM Calculator is built for. Enter your GPU and target context length, pick a model from the catalog, and get an instant answer on whether it fits and at what quantization. When the V4 distills appear, you'll know immediately which variant to pull — before committing to a 10–20GB download.

Context length dramatically changes VRAM requirements. A 7B model at Q4_K_M uses ~4.7GB for weights — but at 32K context, the KV cache adds another 1–2GB. At 128K context, it adds 6–8GB. Models with efficient attention (like V4 distills) compress that KV cache, so they hit longer contexts on the same VRAM. The VRAM Calculator at www.runyard.dev/tools/vram-calculator accounts for both weights and context overhead.

Calculate exactly how much VRAM any model needs at any context length on your specific GPU.

Open the VRAM Calculator → →

March 18, 2026

Try Runyard

Find AI models that fit your exact hardware. Enter your specs and get a ranked list instantly.

Newsletter