Runyard is a free hardware-aware AI model browser. You enter your CPU, GPU, and VRAM and it instantly shows every local LLM that will run on your machine, ranked by speed and quality.

How much VRAM do I need to run local LLMs?

8GB of VRAM runs 7B models like Llama 3.1 8B and Mistral 7B at Q4 quantization. 16GB unlocks 13B models. 24GB lets you run Mixtral 8x7B and Llama 3 70B at lower quantization.

What is the best local LLM for my GPU?

Use Runyard at www.runyard.dev — enter your GPU and VRAM and the Model Radar will rank every compatible LLM for your exact hardware, showing estimated tokens per second for each model.

Can I run Llama 3 locally?

Yes. Llama 3.1 8B at Q4 runs on any 8GB VRAM GPU. Llama 3.1 70B needs around 40GB VRAM at Q4, or an Apple Silicon Mac with 64GB+ unified memory.

← Blog/Ling-2.6-1T Is Now Open Source: Ant Group's Trillion-Parameter MoE Goes Public

May 4, 2026news

Runyard Team

@runyard_dev

12 min read

Contents

▸What Ling-2.6-1T Actually Is ▸The Architecture: MoE + Hybrid Attention ▸Token Efficiency: No Thinking Traces Required ▸What the Benchmarks Actually Say ▸The Hardware Reality: What You Actually Need ▸Deploying with SGLang ▸The One to Actually Run: Ling-2.6-Flash

Ling-2.6-1T Is Now Open Source: Ant Group's Trillion-Parameter MoE Goes Public

Glowing AI neural network representing trillion-parameter scale model release — Ling-2.6-1T — Ant Group's open-source trillion-parameter flagship, now available for self-hosting under MIT.

One trillion parameters, fully open-sourced under MIT. That is the Ling-2.6-1T announcement that landed from InclusionAI — Ant Group's AGI research division — today, May 4, 2026. The model was previewed as a hosted-only release last week; the weights are now live on Hugging Face and the local AI community is already running hardware checks. With 63 billion active parameters per token, a 262K-token context window, and LiveCodeBench scores that clear GPT-5 by 13 points, this is the kind of open-weight frontier drop that recalibrates what's possible for anyone building production AI pipelines. Here is what you actually need to know.

What Ling-2.6-1T Actually Is

Ling-2.6-1T is the flagship model in InclusionAI's Ling 2.6 family, built by Ant Group — one of the world's largest fintech companies, with years of infrastructure built for high-throughput, reliability-critical AI at production scale. The design goal is explicit and different from most frontier releases: this is not a reasoning model. It does not generate hundreds of tokens of internal scratchpad before answering. It is an efficiency-first production model, built for reliable multi-step execution without the token overhead that makes reasoning models expensive to run at scale.

▸Total parameters: 1 trillion — Sparse Mixture-of-Experts (MoE) architecture
▸Active parameters per token: 63 billion — the real per-forward-pass compute cost
▸Context window: 262,144 tokens (262K) natively supported
▸Architecture: MoE + hybrid MLA (Multi-head Latent Attention) + Linear Attention
▸License: MIT — fully permissive, commercial use allowed with no restrictions
▸Weights: huggingface.co/inclusionAI/Ling-2.6-1T — publicly available as of May 4, 2026
▸Recommended deployment: SGLang with tensor parallelism across multiple GPUs

The Architecture: MoE + Hybrid Attention

Why 63B Active Parameters Is the Number That Matters

Mixture-of-Experts means Ling-2.6-1T stores one trillion parameters worth of specialized knowledge across hundreds of expert sub-networks, but only activates 63 billion of them for any given token. A learned router selects which experts to fire based on the input — the rest sit loaded in memory but contribute zero compute to that token. The result is a model with knowledge depth approaching a trillion-parameter system but inference throughput closer to a 63B dense equivalent.

The bottleneck in LLM inference is memory bandwidth — how fast you can stream weights through compute units. Active parameters determine memory bandwidth pressure, not stored ones. A 63B active parameter MoE sustains higher tokens per second than a naive parameter-count comparison would predict, because the MoE router also optimizes memory access patterns across the expert pool. This is why large MoE models consistently deliver better quality-per-second ratios than equivalently-sized dense models at the same hardware tier.

MLA and Linear Attention: Long Context Without the VRAM Wall

Standard transformer attention has a fundamental KV cache problem: memory scales linearly with sequence length. At 262K tokens, a standard attention implementation would make the KV cache alone unmanageable even on a large multi-GPU setup. Ling-2.6-1T addresses this with a hybrid of two techniques.

Multi-head Latent Attention (MLA) — first introduced in DeepSeek V2 — compresses the KV cache by projecting keys and values through a low-rank bottleneck before storing them. Instead of caching full-dimensional key/value vectors for every attention head, MLA stores a compact latent representation and reconstructs the full attention state during inference. This dramatically cuts the per-token memory cost of long contexts. Linear Attention handles local token interactions in O(1) memory — keeping the fast local pass cheap while reserving expensive full global attention for layers where cross-document reasoning is actually needed.

Token Efficiency: No Thinking Traces Required

The dominant pattern in 2025–2026 frontier models has been extended chain-of-thought reasoning: the model generates hundreds or thousands of internal tokens of scratchpad before emitting its answer. This works remarkably well for math competition problems and hard science benchmarks — but it carries a real cost. A reasoning trace of 2,000 internal tokens before outputting 200 tokens of answer costs approximately 10× more compute, latency, and API spend than a model that answers directly.

Ling-2.6-1T skips the scratchpad by design. InclusionAI calls it "token efficiency" — strong intelligence without long reasoning traces. In production pipelines making thousands of calls per day — code generation, document processing, multi-step agent execution — this compounds into significant infrastructure savings. The model delivers quality output on most real-world tasks without writing a novel-length internal monologue first.

When to choose Ling-2.6-1T versus a reasoning model: for math olympiad problems, graduate-level science, or hard multi-hop logic puzzles, a dedicated reasoner like DeepSeek-R1 or Qwen3-Coder-Next in thinking mode will likely win. For production coding agents, instruction following at scale, document processing, and tool-use workflows where latency and cost matter — Ling-2.6-1T's non-reasoning design is a deliberate advantage, not a limitation.

What the Benchmarks Actually Say

LiveCodeBench Score — Ling-2.6-1T vs Other Frontier Models

Ling-2.6-1T

61.7%

Kimi-K2-0905

49%

GPT-5 (main)

48.6%

DeepSeek-V3.1

48%

The LiveCodeBench result is the headline: 61.7% for Ling-2.6-1T against 48–49% for Kimi-K2, GPT-5, and DeepSeek-V3.1. That is a 12+ point gap on a benchmark that does not reward pattern-matching shortcuts — LiveCodeBench requires writing correct, executable code that passes real test suites. This establishes Ling-2.6-1T as the strongest non-reasoning open-weight model on code generation by this measure at the time of release.

▸ArtifactsBench: #1 among all open-source models for front-end code generation — HTML, CSS, JavaScript artifacts
▸SWE-bench Verified: top-tier among open-weight models for real-world software engineering on GitHub issues
▸AIME 2025: strong mathematical reasoning despite the non-reasoning (no extended chain-of-thought) default
▸BFCL-V4 and TAU2-Bench: leading scores on tool-use and function-calling — critical for reliable agent execution
▸Claw-Eval and PinchBench: top results on multi-step agentic evaluation suites focused on production reliability

The Hardware Reality: What You Actually Need

One trillion parameters is not a consumer GPU story — let's be direct about this upfront. Even at aggressive quantization, the raw weight storage for Ling-2.6-1T requires a serious multi-GPU setup. The 63B active parameters mean inference is fast relative to a dense model of the same active size. But all 1T weight values must reside in GPU VRAM simultaneously — the MoE router needs access to every expert sub-network at every inference step, with no lazy loading.

Ling-2.6-1T Total VRAM Required by Quantization Level

FP16 (2 bytes/param)

2000GB

FP8 (1 byte/param)

1000GB

Q4_K_M (~0.5 bytes)

500GB

Q3 (~0.375 bytes)

375GB

Q2 extreme (~0.25 bytes)

250GB

▸8× H100 80GB (640 GB VRAM): runs Q4_K_M comfortably — the recommended minimum for quality self-hosting
▸4× H100 80GB (320 GB VRAM): requires Q2 quantization only — meaningful quality degradation, use for experimentation
▸8× A100 80GB (640 GB VRAM): same VRAM as 8×H100 but lower memory bandwidth — expect roughly 60% the throughput
▸Cloud GPU rental (Lambda, CoreWeave, RunPod — 8×H100): $10–20/hr — practical for benchmarking or one-off workloads
▸CPU offload: technically possible but MoE routing is latency-sensitive — expert-switching overhead on CPU paths is severe

Deploying with SGLang

SGLang is the deployment framework the InclusionAI team used and tested in their technical report. It natively supports the MoE architecture and tensor parallelism across multiple GPUs. The serving setup is a single command once hardware is in place. Start with a reduced --context-length (32K or 65K) rather than the full 262K to keep KV cache allocation manageable when you are near your VRAM ceiling.

terminalbash

# Install SGLang with all extras (CUDA 12.1+ required)
pip install "sglang[all]"

# Serve Ling-2.6-1T across 8 GPUs at FP8 quantization
python -m sglang.launch_server \
  --model-path inclusionAI/Ling-2.6-1T \
  --tp-size 8 \
  --context-length 32768 \
  --quantization fp8 \
  --mem-fraction-static 0.85 \
  --port 8080

# Test the OpenAI-compatible endpoint
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "inclusionAI/Ling-2.6-1T",
    "messages": [{"role": "user", "content": "Write an async Python web scraper using httpx"}],
    "max_tokens": 1024
  }'

The One to Actually Run: Ling-2.6-Flash

If the 1T flagship is outside your hardware budget — and honestly, it is for almost everyone's personal setup — InclusionAI also released Ling-2.6-Flash alongside it. Flash is the compact, consumer-deployable sibling: far fewer total parameters, targeting the 16–48 GB VRAM tier that covers RTX 4090s, M3 Max Macs, and high-end workstations. Early Hugging Face discussions show community members already running Ling-2.6-Flash on RTX 4090 and M4 Max setups.

The architecture is the same: MoE, hybrid MLA + Linear Attention, token efficiency by default, tool-use optimization baked in. Flash inherits everything that makes Ling-2.6-1T's design philosophy compelling — at a weight size you can actually download overnight and run on a single GPU. Community GGUF releases for Ollama and llama.cpp typically follow within a few days of the original Hugging Face upload. Watch huggingface.co/inclusionAI/Ling-2.6-flash.

GGUF quantized versions of Ling-2.6-Flash for Ollama and llama.cpp will appear on Hugging Face within days. Search for "Ling-2.6-flash GGUF" and sort by Most Downloads to find community-tested quants. Before committing to a 20–50 GB download, use the Runyard VRAM Calculator to confirm which quantization level fits your exact GPU — including KV cache headroom at your target context length.

Find out which quantization of Ling-2.6-Flash fits your GPU — and see every other May 2026 model release matched to your exact hardware.

Open the VRAM Calculator → →

March 18, 2026

Try Runyard

Find AI models that fit your exact hardware. Enter your specs and get a ranked list instantly.

Newsletter