Contents
Tags
Something unusual happened in the last two weeks of April 2026. Four Chinese AI labs — Z.ai, MiniMax, Moonshot AI, and DeepSeek — each released open-weight models with frontier-level coding ability, all within a 12-day window. The timing was probably not coincidental. The capability levels were nearly identical. At least two of them can genuinely challenge Claude Opus 4.6 and GPT-5.4 on software engineering benchmarks — at a fraction of the API cost. But there's a catch the headlines keep glossing over: these are Mixture-of-Experts models, and "active parameters" is not the number you should care about when it comes to local hardware.
Here are the four models at a glance, ordered by total parameter count:
On coding benchmarks, Kimi K2.6 leads at 58.6% on SWE-Bench Pro — with GLM-5.1 scoring in the same tier — while MiMo-V2.5-Pro scores 57.2% and MiniMax M2.7 reaches 56.2%. All four comfortably outperform Claude Opus 4.6's 53.4%, and at API pricing they run at roughly a third of Western frontier cost. That's the headline. But if you want to run them locally, the story gets more complicated.
Mixture-of-Experts (MoE) models route each token through only a small fraction of the total network — the "active" experts. Kimi K2.6 has 1 trillion total parameters but activates only 32 billion per forward pass. That means inference is fast and GPU compute is comparable to a dense 32B model. However, you still need to hold all 1 trillion parameters in memory. You cannot load only the active experts; the router dispatches to any of the 384 expert clusters on any token, so all weights must remain addressable at all times.
Q2 quantization is the most compressed useful format — roughly 0.25 bytes per parameter. Even at Q2, DeepSeek V4-Flash needs ~72GB just for weights before context overhead. A single RTX 4090 has 24GB. You need three of them, plus CPU offloading, to load V4-Flash at Q2. This is a home-lab-for-serious-builders situation, not a casual setup.
DeepSeek V4-Flash is the smaller sibling to V4-Pro (1.6T / 49B active). At 284B total with 13B active parameters, it is the most approachable of these four for local inference. It ships under MIT license with a 1M-token context window and is designed for fast reasoning workflows. Community-quantized GGUFs have started appearing on Hugging Face, and llama.cpp can run them with aggressive CPU offloading if you have at least 64–96GB of combined VRAM and system RAM.
GLM-5.1 was the first of the wave to land, released April 7 by Z.ai (Zhipu AI's commercial arm). With 744B total parameters and 40B active, it uses DeepSeek Sparse Attention (DSA) to cut long-context compute overhead significantly. Under a clean MIT license, it covers commercial use, fine-tuning, and redistribution without restrictions. Unsloth has published dynamic GGUF quantizations that bring the Q2 download to around 185GB — Unsloth's dynamic format preserves quality in the most sensitive weight layers rather than applying uniform compression.
Moonshot AI's Kimi K2.6 is the most capable model in this cohort on pure coding benchmarks and adds native multimodality the others lack. Its 384-expert architecture with MLA attention and 256K context makes it powerful for complex agent pipelines. The Modified MIT license is close to permissive but includes derivative-naming clauses worth reviewing before shipping a product. For local inference this is the hardest of the four: 1 trillion total parameters means Q2 quantization still lands around 250GB, requiring serious multi-GPU hardware or Apple Silicon at maximum memory configuration.
MiniMax M2.7 stands out for raw efficiency: 456B total parameters with only 10B active, scoring 56.2% on SWE-Bench Pro. Its hybrid sliding window attention reduces KV-cache size by nearly 6× compared to full attention, making it fast once loaded. The problem for self-hosters building real products: M2.7 is non-commercial only. Production use requires written authorization from MiniMax. This significantly limits its appeal unless you are purely researching or building proof-of-concepts with no revenue attached.
All four models cluster within 2.4 percentage points on SWE-Bench Pro while outperforming Claude Opus 4.6 by 2.8–5.2 points. GLM-5.1 scores in the same tier as Kimi K2.6 and consistently outperformed it on first-pass code quality in head-to-head IDE tests. The practical takeaway: for most coding tasks, the model you can actually run on your hardware — or afford at API rates — matters more than the benchmark spread between them.
The most accessible path to running one of these models locally is DeepSeek V4-Flash with community GGUF quantizations via llama.cpp. With tensor parallelism and CPU layer offloading, you can split across VRAM and system RAM. A machine with 2× RTX 3090 (48GB VRAM) plus 128GB DDR5 system RAM is a viable — though slow — setup for Q2 inference, yielding roughly 1–3 tokens per second.
# Download community Q2 GGUF (~72GB) and run with llama.cpp
# Adjust --n-gpu-layers based on your available VRAM
llama-cli \
-m deepseek-v4-flash-q2_k.gguf \
--n-gpu-layers 20 \
--ctx-size 8192 \
--threads 16 \
-p "Write a Python web scraper for Hacker News top stories"
# Or via Ollama once the community model is indexed
ollama run deepseek-v4-flash:q2_k_mApple's unified memory architecture is the sleeper advantage here. A Mac Studio with M3 Ultra at 192GB unified memory can load DeepSeek V4-Flash at Q3 or Q4 entirely in unified memory, avoiding the speed penalty from splitting between VRAM and system RAM on discrete GPUs. Expect 3–6 tokens per second at these model sizes — not fast, but clean and power-efficient. For GLM-5.1 at Q2 (~186GB), a maxed M3 Ultra (192GB) is extremely tight; a dual-node setup or M4 Ultra configurations is more practical.
Before you ship a product built on one of these models, verify the license carefully:
For production deployments, GLM-5.1 and DeepSeek V4-Flash are the cleanest choices — standard MIT with no ambiguity around commercial use, fine-tuning, or distributing derivative products. Always check the license file on Hugging Face directly rather than relying on the model card, which can be updated without matching license changes.
The Chinese open-source model wave of April 2026 is genuinely impressive — frontier-quality coding models under permissive licenses, released at a pace and cost that Western closed-source labs are not currently matching. But "32B active parameters" does not mean "a 32B model" for your hardware. You are loading hundreds of gigabytes of weights regardless of how many are active on any given token.
For true local inference on consumer hardware, DeepSeek V4-Flash is the only model in this group within realistic reach — and even that requires either a multi-GPU workstation or a Mac with maximum unified memory. If you are running a solo RTX 3090 or RTX 4090, dense models in the 7B–34B range will give you better practical performance today without the complexity of splitting layers across multiple cards and system RAM.
Not sure which models your hardware can actually run? Runyard's VRAM Calculator handles MoE models, quantization levels, and CPU offloading scenarios — enter your GPU and system RAM to see your real options.
Check My Hardware Limits → →Tools
Find AI models that fit your exact hardware. Enter your specs and get a ranked list instantly.
Newsletter