Contents
Tags
On April 24, 2026, DeepSeek announced V4 — and the local AI community immediately split into two camps: people excited that an open-weight model now rivals GPT-5.5 on agentic benchmarks, and people who quietly checked the VRAM requirements and closed the tab. Both reactions are correct. DeepSeek V4 is genuinely remarkable AND genuinely out of reach for most home hardware. Here's an honest breakdown of what it is, what it isn't, and — more importantly — why it matters even if you can't run it today.
DeepSeek V4 comes in two variants, both open-weight under the MIT license. V4 Pro is the flagship: 1.6 trillion total parameters with 49 billion active per token. V4 Flash is the more accessible option: 284 billion total with 13 billion active per token. Both use a Mixture-of-Experts (MoE) architecture and both support a 1 million token context window.
MoE (Mixture-of-Experts) means the model has many specialized sub-networks but only routes each token through a small fraction of them. V4 Pro's "49B active" is the number that actually determines inference cost — not the headline 1.6T. This is how DeepSeek matches dense 70B+ model quality while being cheaper per token at scale.
Let's not bury the lead: running DeepSeek V4 locally is hard. This is not a "needs a good GPU" situation — it's a "needs a small cluster" situation. The numbers below are based on community benchmarks and inference guides published in the five days since launch.
V4 Flash at Q4 needs roughly 96GB of pooled GPU memory for comfortable inference — that's a dual RTX 3090 at minimum, or a single H200. V4 Pro at Q4 is around 400GB+, which puts it firmly in multi-node territory. Even at Q3, V4 Flash needs 72GB — two RTX 4090s barely cover it.
vLLM is the recommended inference framework for DeepSeek V4 — it natively supports MoE expert parallelism and the hybrid attention architecture. Ollama does not yet have a V4 template; you'll need vLLM or SGLang for now.
# Install vLLM (requires CUDA 12.1+)
pip install vllm
# Serve DeepSeek V4 Flash (needs 96GB+ pooled VRAM at Q4)
vllm serve deepseek-ai/DeepSeek-V4-Flash \
--tensor-parallel-size 2 \ # adjust for your GPU count
--max-model-len 32768 \ # reduce from 1M to save VRAM
--quantization fp8 \ # or remove for BF16
--port 8000
# Test with the OpenAI-compatible endpoint
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-V4-Flash",
"messages": [{"role": "user", "content": "Hello"}]
}'Set --max-model-len to 16384 or 32768 if you're trying to run V4-Flash on under 96GB. Reducing context length is the single biggest lever for cutting VRAM usage on large MoE models.
Here's the part of the DeepSeek V4 story that gets lost in the hardware hand-wringing: this release changes the entire local LLM ecosystem even for people who will never run V4 directly.
DeepSeek V3's release in late 2025 was followed within weeks by a wave of distilled models — smaller models trained to mimic V3's reasoning. The DeepSeek-R1 distills (1.5B, 7B, 8B, 14B, 32B, 70B) all trace back to that frontier model. V4 will produce the same wave. Distilled 8B and 14B versions will run on 8–12GB GPUs and inherit V4's architectural improvements.
DeepSeek V4 Pro posts scores alongside GPT-5.5 and Claude Opus 4.7 on MATH-500, GPQA Diamond, and SWE-bench Verified — as an open-weight model under MIT license. This matters for every smaller open model because it redefines what "frontier-level" means in the open-weights world.
Every time DeepSeek releases a frontier open-weight model, it forces Meta, Alibaba, Mistral, and Google to respond faster. Llama 4 Scout and Maverick shipped in April 2026 partly in response to this competitive pressure. Qwen 3.6 27B dropped April 22. The local LLM user benefits from this arms race in the form of better consumer-runnable models, sooner.
If you have 8–48GB of VRAM, the models available today are excellent — and the V4 distills will land within weeks. Here's the practical rundown by hardware tier:
Not sure if a model fits your GPU? The Runyard VRAM Calculator at runyard.dev/tools/vram-calculator shows you the exact memory footprint for every quantization level — including KV cache headroom — before you download anything.
On the benchmarks that matter for the local LLM community — coding, math, and agentic tasks — V4 Pro is competitive with the best closed models. V4 Flash tells the more interesting self-hosting story: it trades roughly 5 points on most benchmarks for 5–6x better hardware efficiency in terms of active parameters.
See which models fit your GPU right now — including every April 2026 release, ranked for your hardware.
Open the Runyard Model Radar → →Tools
Find AI models that fit your exact hardware. Enter your specs and get a ranked list instantly.
Newsletter