Contents
Tags
On April 22, 2026, Xiaomi's AI research team released MiMo-V2.5 — a fully open-source multimodal model with 310 billion total parameters and only 15 billion active per token. Built on the MiMo-V2-Flash backbone and extended with dedicated vision and audio encoders, MiMo-V2.5 targets a specific category of task that most models fail at: long-horizon agentic work. Running multi-step coding agents, navigating real software repositories, reasoning across a million-token context window — these are the problems MiMo-V2.5 was engineered to solve. Here's what it actually is, how it stacks up, and what you realistically need to run it.
MiMo is Xiaomi's large model research line — and V2.5 is the third public release in the family, building on the V2 generation from late 2025. Unlike many so-called open releases that publish weights under restrictive licenses or withhold model cards, MiMo-V2.5 ships with weights, tokenizer, and full documentation available on Hugging Face at XiaomiMiMo/MiMo-V2.5. Nothing is held back.
Mixture-of-Experts is the architectural trick that lets MiMo-V2.5 store 310 billion parameters worth of knowledge while only activating ~15 billion per token during inference. Think of it like a library with a smart routing system: all 310B "experts" are on the shelves, but a learned router selects only the most relevant specialists for each token — keeping inference cost close to a dense 15B model while knowledge capacity resembles something far larger.
This is why MoE models consistently punch above their active parameter count on benchmarks. The routing mechanism allows the model to accumulate narrow specializations across hundreds of expert sub-networks, then selectively apply only the relevant ones per token. A dense 15B model cannot do this — it has to generalize across all domains with the same fixed weights.
MoE trade-off in plain English: a dense 15B model holds 15B parameters worth of knowledge — everything is always active. A 310B MoE with 15B active holds roughly 20× more knowledge while keeping per-token compute comparable. The cost you pay is storage: all 310B parameters must sit in RAM or VRAM even though only 15B do work on any given token. That storage requirement is where consumer hardware hits its wall.
Xiaomi's focus for MiMo-V2.5 is agentic workflows — specifically the kind of long-horizon, multi-step task completion that separates genuinely useful AI agents from impressive demos. The benchmark category they emphasize is CLAW (Coding with Long-context Agentic Workloads): tasks that require navigating real codebases, writing and running code, reading test output, iterating based on failures, and completing a multi-step goal without human checkpoints. VentureBeat noted that MiMo-V2.5 and its Pro sibling are among the most efficient models tested on these agentic task categories.
Running MiMo-V2.5 locally on consumer hardware is not straightforward. While 15B active parameters sounds manageable, the full 310B weight file must still be resident in memory — and that requirement is large at every quantization level. This is the central tension of large MoE models: fast inference, massive storage footprint.
The recommended inference framework for MiMo-V2.5 is SGLang — a Python engine designed for structured generation and high-throughput MoE model serving. It supports tensor parallelism across multiple GPUs, distributed attention, and the quantization options you need to get MiMo-V2.5 running at acceptable speeds on server hardware.
# Install SGLang for MiMo inference
pip install "sglang[all]"
# Launch MiMo-V2.5 across 4x H100 GPUs at int8 precision
python -m sglang.launch_server \
--model-path XiaomiMiMo/MiMo-V2.5 \
--tp-size 4 \
--quantization int8 \
--context-length 32768 \
--port 8080
# Test a generation request against the running server
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"MiMo-V2.5","messages":[{"role":"user","content":"Debug this function and write tests."}]}'
# For consumer hardware — test the smaller MiMo-V2-Flash (12-16GB VRAM):
# github.com/XiaomiMiMo/MiMo-V2-FlashCommunity GGUF quants for new models typically appear on Hugging Face within 1–2 weeks of launch. Search for XiaomiMiMo GGUFs and watch the llama.cpp GitHub discussions. Once available, IQ2 quants can be loaded with llama-server for testing — even on CPU-offload setups with 256GB+ system RAM, at 2–4 tok/s but fully functional for validation.
The honest breakdown of who can run MiMo-V2.5 locally today is similar to the DeepSeek V4 situation from last week — a short list, but not empty.
If you cannot run the full V2.5 today, MiMo-V2-Flash is the model to try. It's the compact backbone that MiMo-V2.5 was built on — published by Xiaomi on GitHub as a standalone model designed explicitly for single-GPU deployment on 12–16GB VRAM. Community testers report strong reasoning capability at a size that fits in consumer setups. Find it at github.com/XiaomiMiMo/MiMo-V2-Flash.
The pattern is now well-established: a frontier-quality model drops as open weights. Immediately it's out of reach for consumer hardware. Six to eight weeks later, distilled variants at 7B, 14B, and 32B appear — retaining a meaningful fraction of the parent model's reasoning style at sizes that fit on a gaming GPU. And because the parent model is public, distill quality tends to be significantly better than distilling from a closed API model.
MiMo-V2.5 is also notable for something the current consumer-friendly local models don't offer: a truly unified multimodal architecture handling text, images, video, and audio in one model. Most local AI setups in 2026 still require separate models for vision and text. A 14B distill that retains multimodal capability would create a new category of local model — a single Ollama pull that handles every input type with no separate vision pipeline required.
The 1-million-token context window is the other landmark worth tracking. Even if current consumer hardware tops out at 32K context before VRAM runs out, the architectural capability for million-token reasoning is already baked in. As KV cache compression techniques like TurboQuant mature in llama.cpp, a distilled MiMo-V2.5 running at 128K context on 24GB VRAM becomes plausible by Q3 2026.
Track the MiMo-V2.5 release at huggingface.co/XiaomiMiMo and watch for new model IDs in the coming weeks. When community distills appear, use Runyard's VRAM Calculator to check hardware fit before committing to a multi-gigabyte download.
Find which coding and agentic models fit your GPU right now — and be ready when the MiMo distills land.
Compare models on your hardware → →Tools
Find AI models that fit your exact hardware. Enter your specs and get a ranked list instantly.
Newsletter