← Blog/Four Chinese MoE Models in Two Weeks: Which Can You Actually Run Locally?

May 8, 2026news

Runyard Team

@runyard_dev

12 min read

Contents

▸The Class of Spring 2026 ▸The MoE Paradox: Fast to Run, Massive to Load ▸Model Profiles ▸Benchmark Comparison: Real Coding Ability ▸Hardware Reality Check: What Setup Do You Actually Need?▸The License Landscape ▸Bottom Line

Tags

#moe#local-llm#deepseek#kimi#glm#hardware#vram#news

Four Chinese MoE Models in Two Weeks: Which Can You Actually Run Locally?

Server racks in a data center representing large-scale AI model infrastructure — Four open-weight trillion-parameter MoE models dropped in two weeks. Your hardware budget may not be ready.

Something unusual happened in the last two weeks of April 2026. Four Chinese AI labs — Z.ai, MiniMax, Moonshot AI, and DeepSeek — each released open-weight models with frontier-level coding ability, all within a 12-day window. The timing was probably not coincidental. The capability levels were nearly identical. At least two of them can genuinely challenge Claude Opus 4.6 and GPT-5.4 on software engineering benchmarks — at a fraction of the API cost. But there's a catch the headlines keep glossing over: these are Mixture-of-Experts models, and "active parameters" is not the number you should care about when it comes to local hardware.

The Class of Spring 2026

Here are the four models at a glance, ordered by total parameter count:

▸DeepSeek V4-Flash — 284B total / 13B active, MIT license, 1M context, released April 24
▸GLM-5.1 — 744B total / 40B active, MIT license, released April 7 by Z.ai / Zhipu AI
▸Kimi K2.6 — 1T total / 32B active, Modified MIT license, released April 20 by Moonshot AI
▸MiniMax M2.7 — 456B total / 10B active, non-commercial license, released March 2026

On coding benchmarks, Kimi K2.6 leads at 58.6% on SWE-Bench Pro — with GLM-5.1 scoring in the same tier — while MiMo-V2.5-Pro scores 57.2% and MiniMax M2.7 reaches 56.2%. All four comfortably outperform Claude Opus 4.6's 53.4%, and at API pricing they run at roughly a third of Western frontier cost. That's the headline. But if you want to run them locally, the story gets more complicated.

The MoE Paradox: Fast to Run, Massive to Load

What "Active Parameters" Actually Means for Your Hardware

Mixture-of-Experts (MoE) models route each token through only a small fraction of the total network — the "active" experts. Kimi K2.6 has 1 trillion total parameters but activates only 32 billion per forward pass. That means inference is fast and GPU compute is comparable to a dense 32B model. However, you still need to hold all 1 trillion parameters in memory. You cannot load only the active experts; the router dispatches to any of the 384 expert clusters on any token, so all weights must remain addressable at all times.

Minimum Memory Required at Q2 Quantization (GB)

DeepSeek V4-Flash (284B)

72GB

MiniMax M2.7 (456B)

115GB

GLM-5.1 (744B)

186GB

Kimi K2.6 (1T)

250GB

Q2 quantization is the most compressed useful format — roughly 0.25 bytes per parameter. Even at Q2, DeepSeek V4-Flash needs ~72GB just for weights before context overhead. A single RTX 4090 has 24GB. You need three of them, plus CPU offloading, to load V4-Flash at Q2. This is a home-lab-for-serious-builders situation, not a casual setup.

Model Profiles

DeepSeek V4-Flash — The Most Accessible of the Group

DeepSeek V4-Flash is the smaller sibling to V4-Pro (1.6T / 49B active). At 284B total with 13B active parameters, it is the most approachable of these four for local inference. It ships under MIT license with a 1M-token context window and is designed for fast reasoning workflows. Community-quantized GGUFs have started appearing on Hugging Face, and llama.cpp can run them with aggressive CPU offloading if you have at least 64–96GB of combined VRAM and system RAM.

▸284B total / 13B active (MoE architecture)
▸1M token context window
▸MIT license — full commercial use permitted
▸Community GGUFs available for llama.cpp and Ollama
▸Practical minimum: ~72GB at Q2, spread across VRAM and system RAM
▸Works on 4× RTX 3090 (96GB VRAM) or Mac Studio M2 Ultra (192GB unified memory)

GLM-5.1 — The MIT-Licensed Long-Context Workhorse

GLM-5.1 was the first of the wave to land, released April 7 by Z.ai (Zhipu AI's commercial arm). With 744B total parameters and 40B active, it uses DeepSeek Sparse Attention (DSA) to cut long-context compute overhead significantly. Under a clean MIT license, it covers commercial use, fine-tuning, and redistribution without restrictions. Unsloth has published dynamic GGUF quantizations that bring the Q2 download to around 185GB — Unsloth's dynamic format preserves quality in the most sensitive weight layers rather than applying uniform compression.

▸744B total / 40B active, trained on 28.5T tokens
▸DeepSeek Sparse Attention (DSA) for efficient long-context processing
▸MIT license — no restrictions on commercial use or redistribution
▸Unsloth dynamic GGUF available (~185GB at Q2 dynamic quantization)
▸Practical minimum: ~186GB at standard Q2, or ~370GB at Q4
▸Targets multi-GPU servers or dual Mac Studio Ultra configurations

Kimi K2.6 — The Feature-Rich Trillion-Parameter Model

Moonshot AI's Kimi K2.6 is the most capable model in this cohort on pure coding benchmarks and adds native multimodality the others lack. Its 384-expert architecture with MLA attention and 256K context makes it powerful for complex agent pipelines. The Modified MIT license is close to permissive but includes derivative-naming clauses worth reviewing before shipping a product. For local inference this is the hardest of the four: 1 trillion total parameters means Q2 quantization still lands around 250GB, requiring serious multi-GPU hardware or Apple Silicon at maximum memory configuration.

▸1T total / 32B active, 384 experts (8 routed + 1 shared)
▸Native multimodal capabilities (vision + text input)
▸256K token context window
▸Modified MIT license — permissive but review derivative-naming restrictions
▸Official INT4 quantization available from Moonshot AI
▸Practical minimum: ~250GB at Q2 — requires server-grade multi-GPU or 192GB+ unified memory

MiniMax M2.7 — Best Efficiency, Restrictive License

MiniMax M2.7 stands out for raw efficiency: 456B total parameters with only 10B active, scoring 56.2% on SWE-Bench Pro. Its hybrid sliding window attention reduces KV-cache size by nearly 6× compared to full attention, making it fast once loaded. The problem for self-hosters building real products: M2.7 is non-commercial only. Production use requires written authorization from MiniMax. This significantly limits its appeal unless you are purely researching or building proof-of-concepts with no revenue attached.

▸456B total / 10B active — best active-to-total ratio in the group
▸Hybrid sliding window + global attention, 262K context window
▸Non-commercial license — contact MiniMax before any commercial deployment
▸Practical minimum: ~115GB at Q2 quantization
▸Available as a free tier model on OpenRouter for experimentation

Benchmark Comparison: Real Coding Ability

SWE-Bench Pro Score (% resolved)

Kimi K2.6

58.6%

MiMo-V2.5-Pro

57.2%

MiniMax M2.7

56.2%

Claude Opus 4.6 (baseline)

53.4%

All four models cluster within 2.4 percentage points on SWE-Bench Pro while outperforming Claude Opus 4.6 by 2.8–5.2 points. GLM-5.1 scores in the same tier as Kimi K2.6 and consistently outperformed it on first-pass code quality in head-to-head IDE tests. The practical takeaway: for most coding tasks, the model you can actually run on your hardware — or afford at API rates — matters more than the benchmark spread between them.

Hardware Reality Check: What Setup Do You Actually Need?

Running DeepSeek V4-Flash with llama.cpp

The most accessible path to running one of these models locally is DeepSeek V4-Flash with community GGUF quantizations via llama.cpp. With tensor parallelism and CPU layer offloading, you can split across VRAM and system RAM. A machine with 2× RTX 3090 (48GB VRAM) plus 128GB DDR5 system RAM is a viable — though slow — setup for Q2 inference, yielding roughly 1–3 tokens per second.

terminalbash

# Download community Q2 GGUF (~72GB) and run with llama.cpp
# Adjust --n-gpu-layers based on your available VRAM
llama-cli \
  -m deepseek-v4-flash-q2_k.gguf \
  --n-gpu-layers 20 \
  --ctx-size 8192 \
  --threads 16 \
  -p "Write a Python web scraper for Hacker News top stories"

# Or via Ollama once the community model is indexed
ollama run deepseek-v4-flash:q2_k_m

Apple Silicon for the Ambitious

Apple's unified memory architecture is the sleeper advantage here. A Mac Studio with M3 Ultra at 192GB unified memory can load DeepSeek V4-Flash at Q3 or Q4 entirely in unified memory, avoiding the speed penalty from splitting between VRAM and system RAM on discrete GPUs. Expect 3–6 tokens per second at these model sizes — not fast, but clean and power-efficient. For GLM-5.1 at Q2 (~186GB), a maxed M3 Ultra (192GB) is extremely tight; a dual-node setup or M4 Ultra configurations is more practical.

▸M3 Ultra 192GB — handles DeepSeek V4-Flash at Q3 comfortably; GLM-5.1 at Q2 is possible but tight
▸M3 Max 128GB — DeepSeek V4-Flash at Q2 (~72GB) fits with headroom for context
▸M4 Max 128GB — similar memory ceiling to M3 Max with improved memory bandwidth
▸Dual Mac Studio M2 Ultra — experimental multi-node tensor split for GLM-5.1 and above

The License Landscape

Before you ship a product built on one of these models, verify the license carefully:

1.DeepSeek V4-Flash — MIT. Full commercial use, modification, and redistribution permitted.
2.GLM-5.1 — MIT. Full commercial use permitted, no restrictions on fine-tuning or redistribution.
3.Kimi K2.6 — Modified MIT. Permissive, but includes derivative-naming restrictions — read the LICENSE file on Hugging Face.
4.MiniMax M2.7 — Non-commercial only. Written authorization required from MiniMax before any commercial deployment.

For production deployments, GLM-5.1 and DeepSeek V4-Flash are the cleanest choices — standard MIT with no ambiguity around commercial use, fine-tuning, or distributing derivative products. Always check the license file on Hugging Face directly rather than relying on the model card, which can be updated without matching license changes.

Bottom Line

The Chinese open-source model wave of April 2026 is genuinely impressive — frontier-quality coding models under permissive licenses, released at a pace and cost that Western closed-source labs are not currently matching. But "32B active parameters" does not mean "a 32B model" for your hardware. You are loading hundreds of gigabytes of weights regardless of how many are active on any given token.

For true local inference on consumer hardware, DeepSeek V4-Flash is the only model in this group within realistic reach — and even that requires either a multi-GPU workstation or a Mac with maximum unified memory. If you are running a solo RTX 3090 or RTX 4090, dense models in the 7B–34B range will give you better practical performance today without the complexity of splitting layers across multiple cards and system RAM.

Not sure which models your hardware can actually run? Runyard's VRAM Calculator handles MoE models, quantization levels, and CPU offloading scenarios — enter your GPU and system RAM to see your real options.

Check My Hardware Limits → →

More Posts

March 18, 2026

How Much VRAM Do You Need to Run Local LLMs?

March 15, 2026

Best Local LLMs for Coding in 2026

March 12, 2026

Ollama vs LM Studio: Which Should You Use in 2026?

← Back to Blog

Tools

Try Runyard

Find AI models that fit your exact hardware. Enter your specs and get a ranked list instantly.

Newsletter