Runyard is a free hardware-aware AI model browser. You enter your CPU, GPU, and VRAM and it instantly shows every local LLM that will run on your machine, ranked by speed and quality.

How much VRAM do I need to run local LLMs?

8GB of VRAM runs 7B models like Llama 3.1 8B and Mistral 7B at Q4 quantization. 16GB unlocks 13B models. 24GB lets you run Mixtral 8x7B and Llama 3 70B at lower quantization.

What is the best local LLM for my GPU?

Use Runyard at www.runyard.dev — enter your GPU and VRAM and the Model Radar will rank every compatible LLM for your exact hardware, showing estimated tokens per second for each model.

Can I run Llama 3 locally?

Yes. Llama 3.1 8B at Q4 runs on any 8GB VRAM GPU. Llama 3.1 70B needs around 40GB VRAM at Q4, or an Apple Silicon Mac with 64GB+ unified memory.

← Blog/Xiaomi MiMo-V2.5: Open-Source Agentic AI With a 1-Million-Token Context Window

May 1, 2026news

Runyard Team

@runyard_dev

11 min read

Contents

▸What Is MiMo-V2.5?▸Why 15B Active Parameters Is the Number That Actually Matters ▸What MiMo-V2.5 Is Actually Built For ▸The Honest Hardware Reality ▸Realistic Options for Consumer Hardware Right Now ▸The Smaller MiMo Models Worth Watching Right Now ▸Why This Release Matters Even If You Can't Run It Yet

Xiaomi MiMo-V2.5: Open-Source Agentic AI With a 1-Million-Token Context Window

Futuristic robot representing agentic AI — Xiaomi MiMo-V2.5 open-source model — MiMo-V2.5 is Xiaomi's open-source answer to agentic AI — a model built to take actions, not just answer questions.

On April 22, 2026, Xiaomi's AI research team released MiMo-V2.5 — a fully open-source multimodal model with 310 billion total parameters and only 15 billion active per token. Built on the MiMo-V2-Flash backbone and extended with dedicated vision and audio encoders, MiMo-V2.5 targets a specific category of task that most models fail at: long-horizon agentic work. Running multi-step coding agents, navigating real software repositories, reasoning across a million-token context window — these are the problems MiMo-V2.5 was engineered to solve. Here's what it actually is, how it stacks up, and what you realistically need to run it.

What Is MiMo-V2.5?

MiMo is Xiaomi's large model research line — and V2.5 is the third public release in the family, building on the V2 generation from late 2025. Unlike many so-called open releases that publish weights under restrictive licenses or withhold model cards, MiMo-V2.5 ships with weights, tokenizer, and full documentation available on Hugging Face at XiaomiMiMo/MiMo-V2.5. Nothing is held back.

▸Total parameters: 310 billion — Sparse Mixture-of-Experts architecture
▸Active parameters per token: ~15 billion — the real per-forward-pass compute cost
▸Context window: 32K native, progressively extendable to 256K and 1 million tokens
▸Modalities: text, image, video, and audio — fully unified single-model architecture
▸Backbone: MiMo-V2-Flash with a SigLIP2 NaFlex vision encoder added for multimodal input
▸License: permissive open-source — weights publicly available on Hugging Face
▸Public beta launched: April 22, 2026

Why 15B Active Parameters Is the Number That Actually Matters

Mixture-of-Experts is the architectural trick that lets MiMo-V2.5 store 310 billion parameters worth of knowledge while only activating ~15 billion per token during inference. Think of it like a library with a smart routing system: all 310B "experts" are on the shelves, but a learned router selects only the most relevant specialists for each token — keeping inference cost close to a dense 15B model while knowledge capacity resembles something far larger.

This is why MoE models consistently punch above their active parameter count on benchmarks. The routing mechanism allows the model to accumulate narrow specializations across hundreds of expert sub-networks, then selectively apply only the relevant ones per token. A dense 15B model cannot do this — it has to generalize across all domains with the same fixed weights.

MoE trade-off in plain English: a dense 15B model holds 15B parameters worth of knowledge — everything is always active. A 310B MoE with 15B active holds roughly 20× more knowledge while keeping per-token compute comparable. The cost you pay is storage: all 310B parameters must sit in RAM or VRAM even though only 15B do work on any given token. That storage requirement is where consumer hardware hits its wall.

What MiMo-V2.5 Is Actually Built For

Xiaomi's focus for MiMo-V2.5 is agentic workflows — specifically the kind of long-horizon, multi-step task completion that separates genuinely useful AI agents from impressive demos. The benchmark category they emphasize is CLAW (Coding with Long-context Agentic Workloads): tasks that require navigating real codebases, writing and running code, reading test output, iterating based on failures, and completing a multi-step goal without human checkpoints. VentureBeat noted that MiMo-V2.5 and its Pro sibling are among the most efficient models tested on these agentic task categories.

▸Agentic coding — Multi-step software engineering across full repositories, not just isolated function generation
▸Long-horizon reasoning — Complex problems requiring many chained reasoning steps across very large context windows
▸Multimodal understanding — Process code screenshots, architecture diagrams, video tutorials, and audio instructions in a single model pass
▸Document analysis — Deep comprehension at 256K–1M token context; read an entire technical specification or large codebase in one window
▸Real-world task completion — Navigate filesystems, call tools, execute shell commands, interpret output, recover from errors gracefully

MiMo-V2.5 VRAM Required by Quantization Level

FP16 (full precision)

620GB

Q8 quantization

310GB

Q4_K_M

155GB

IQ2 (extreme quant)

77GB

The Honest Hardware Reality

Running MiMo-V2.5 locally on consumer hardware is not straightforward. While 15B active parameters sounds manageable, the full 310B weight file must still be resident in memory — and that requirement is large at every quantization level. This is the central tension of large MoE models: fast inference, massive storage footprint.

▸FP16 — ~620GB: requires 8× H100 80GB or equivalent server cluster. Pure research territory.
▸Q8 — ~310GB: 4× A100 80GB minimum. Enterprise infrastructure.
▸Q4_K_M — ~155GB: theoretically fits on a Mac Pro with M4 Ultra (192GB unified memory) or a high-RAM multi-GPU workstation with pooled VRAM.
▸IQ2 — ~77GB: within reach of a single Mac Studio M4 Ultra (192GB) at roughly 5–10 tok/s.
▸CPU offload via llama.cpp — possible at IQ2 with 256GB+ system RAM once community GGUFs appear. Expect 2–4 tok/s.

The recommended inference framework for MiMo-V2.5 is SGLang — a Python engine designed for structured generation and high-throughput MoE model serving. It supports tensor parallelism across multiple GPUs, distributed attention, and the quantization options you need to get MiMo-V2.5 running at acceptable speeds on server hardware.

terminalbash

# Install SGLang for MiMo inference
pip install "sglang[all]"

# Launch MiMo-V2.5 across 4x H100 GPUs at int8 precision
python -m sglang.launch_server \
  --model-path XiaomiMiMo/MiMo-V2.5 \
  --tp-size 4 \
  --quantization int8 \
  --context-length 32768 \
  --port 8080

# Test a generation request against the running server
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"MiMo-V2.5","messages":[{"role":"user","content":"Debug this function and write tests."}]}'

# For consumer hardware — test the smaller MiMo-V2-Flash (12-16GB VRAM):
# github.com/XiaomiMiMo/MiMo-V2-Flash

Community GGUF quants for new models typically appear on Hugging Face within 1–2 weeks of launch. Search for XiaomiMiMo GGUFs and watch the llama.cpp GitHub discussions. Once available, IQ2 quants can be loaded with llama-server for testing — even on CPU-offload setups with 256GB+ system RAM, at 2–4 tok/s but fully functional for validation.

Realistic Options for Consumer Hardware Right Now

The honest breakdown of who can run MiMo-V2.5 locally today is similar to the DeepSeek V4 situation from last week — a short list, but not empty.

1.Mac Studio or Mac Pro with M4 Ultra (192GB unified memory) — IQ2/IQ3 fits at ~5–10 tok/s. The most practical single-machine consumer option today.
2.Multi-GPU NVIDIA workstations (4× RTX 4090 = 96GB pooled VRAM) — Aggressive IQ2 quant only. Experimental; not yet officially benchmarked at this topology.
3.Server-grade hardware (A100/H100 cluster) — The intended deployment target. Q8 precision at full throughput.
4.Cloud GPU rentals (8× H100 on Lambda, CoreWeave, or RunPod) — Most practical for one-off experiments at $5–15/hr.
5.High-RAM workstation (512GB system RAM) + CPU offload — ~2–4 tok/s at IQ2, once community GGUFs land. Usable for batch tasks, not real-time chat.

The Smaller MiMo Models Worth Watching Right Now

If you cannot run the full V2.5 today, MiMo-V2-Flash is the model to try. It's the compact backbone that MiMo-V2.5 was built on — published by Xiaomi on GitHub as a standalone model designed explicitly for single-GPU deployment on 12–16GB VRAM. Community testers report strong reasoning capability at a size that fits in consumer setups. Find it at github.com/XiaomiMiMo/MiMo-V2-Flash.

▸MiMo-V2-Flash — Compact, single-GPU-friendly (12–16GB VRAM). The reasoning backbone MiMo-V2.5 extends. Good coding and instruction following at accessible size.
▸MiMo-V2.5-Base — Full 310B base weights for fine-tuning and research. Available at XiaomiMiMo/MiMo-V2.5-Base on Hugging Face.
▸Community distills — Expect 7B and 14B distilled variants in the coming weeks. Watch Hugging Face for XiaomiMiMo community uploads.

Why This Release Matters Even If You Can't Run It Yet

The pattern is now well-established: a frontier-quality model drops as open weights. Immediately it's out of reach for consumer hardware. Six to eight weeks later, distilled variants at 7B, 14B, and 32B appear — retaining a meaningful fraction of the parent model's reasoning style at sizes that fit on a gaming GPU. And because the parent model is public, distill quality tends to be significantly better than distilling from a closed API model.

MiMo-V2.5 is also notable for something the current consumer-friendly local models don't offer: a truly unified multimodal architecture handling text, images, video, and audio in one model. Most local AI setups in 2026 still require separate models for vision and text. A 14B distill that retains multimodal capability would create a new category of local model — a single Ollama pull that handles every input type with no separate vision pipeline required.

The 1-million-token context window is the other landmark worth tracking. Even if current consumer hardware tops out at 32K context before VRAM runs out, the architectural capability for million-token reasoning is already baked in. As KV cache compression techniques like TurboQuant mature in llama.cpp, a distilled MiMo-V2.5 running at 128K context on 24GB VRAM becomes plausible by Q3 2026.

Track the MiMo-V2.5 release at huggingface.co/XiaomiMiMo and watch for new model IDs in the coming weeks. When community distills appear, use Runyard's VRAM Calculator to check hardware fit before committing to a multi-gigabyte download.

Find which coding and agentic models fit your GPU right now — and be ready when the MiMo distills land.

Compare models on your hardware → →

March 18, 2026

Try Runyard

Find AI models that fit your exact hardware. Enter your specs and get a ranked list instantly.

Newsletter