Runyard is a free hardware-aware AI model browser. You enter your CPU, GPU, and VRAM and it instantly shows every local LLM that will run on your machine, ranked by speed and quality.

How much VRAM do I need to run local LLMs?

8GB of VRAM runs 7B models like Llama 3.1 8B and Mistral 7B at Q4 quantization. 16GB unlocks 13B models. 24GB lets you run Mixtral 8x7B and Llama 3 70B at lower quantization.

What is the best local LLM for my GPU?

Use Runyard at www.runyard.dev — enter your GPU and VRAM and the Model Radar will rank every compatible LLM for your exact hardware, showing estimated tokens per second for each model.

Can I run Llama 3 locally?

Yes. Llama 3.1 8B at Q4 runs on any 8GB VRAM GPU. Llama 3.1 70B needs around 40GB VRAM at Q4, or an Apple Silicon Mac with 64GB+ unified memory.

← Blog/Qwen3.6-35B-A3B: 73% SWE-bench with Only 3.5B Active Params — The Fastest Coding MoE You Can Actually Run

May 7, 2026local-ai

Runyard Team

@runyard_dev

12 min read

Contents

▸The Number That Changes Everything: 10:1 Expert Ratio ▸What Is Qwen3.6-35B-A3B?▸Benchmark Context: Where 73.4% Actually Sits ▸How MoE Changes Your VRAM Math ▸Thinking Mode: One Checkpoint, Two Personalities ▸Running It with Ollama ▸Running with vLLM for High-Throughput Pipelines ▸35B-A3B vs 27B Dense: Which Should You Run?▸Real Use Cases Where MoE Speed Makes a Concrete Difference ▸What Hardware Do You Need?▸Verify Your Hardware Before Downloading 21GB

Qwen3.6-35B-A3B: 73% SWE-bench with Only 3.5B Active Params — The Fastest Coding MoE You Can Actually Run

Code on a dark screen — local AI coding with Qwen3.6-35B-A3B MoE — Qwen3.6-35B-A3B stores 35 billion parameters in VRAM but computes through only 3.5 billion per token. The math works heavily in your favor.

The Qwen3.6 family landed in April 2026 and the headline model was the dense 27B — 77.2% SWE-bench, great numbers. But the variant that deserves more attention for local AI runners is the 35B-A3B: a Mixture-of-Experts architecture that loads 35 billion parameters but only routes through 3.5 billion per token. The result is 73.4% on SWE-bench Verified at a compute cost closer to a 3B model than a 35B. It generates tokens on an RTX 4090 at roughly 95 tok/s — more than twice the speed of the dense 27B on the same GPU. And almost nobody in the local AI community is talking about it yet.

The Number That Changes Everything: 10:1 Expert Ratio

Most local model discussions focus on total parameter count — 7B, 13B, 70B. For dense models, total params is the right metric because every parameter is used on every token. MoE (Mixture-of-Experts) models break that assumption entirely. Qwen3.6-35B-A3B has 35 billion total parameters, but at inference time only 3.5 billion are active per forward pass. The rest sit loaded in VRAM, available but idle for that particular token.

LLM inference speed is bottlenecked by how many parameters you compute through per token — not by how many are loaded. A 35B-A3B model generates tokens at roughly the speed of a 3.5B dense model, while delivering output quality that required 35B parameters to train. You pay the VRAM cost of a 35B model and get the latency of a 3.5B model. That trade-off is specifically what makes MoE architectures so interesting for local runners who care about responsiveness.

What Is Qwen3.6-35B-A3B?

▸35B total parameters, 3.5B active per forward pass — 10:1 expert ratio
▸73.4% SWE-bench Verified — competitive with GPT-4o-class models on real-world coding tasks
▸53.5% SWE-bench Pro — strong performance on harder agentic coding scenarios
▸262K token native context window, extensible to approximately 1 million tokens
▸Multimodal input: accepts both text and images in a single unified model
▸Dual mode in one checkpoint: thinking (chain-of-thought) and non-thinking for fast answers
▸Apache 2.0 license — free for commercial use, no restrictions
▸Released April 2026 by Alibaba's Qwen team on Hugging Face and ModelScope

Benchmark Context: Where 73.4% Actually Sits

SWE-bench Verified is the hardest coding benchmark widely used in 2026. It tests models on real GitHub issues requiring understanding of a full codebase, identifying the root cause of a bug, and writing a patch that passes all existing tests. A 73.4% score puts Qwen3.6-35B-A3B well into the range of models that can handle genuine software engineering tasks — not just toy LeetCode problems.

SWE-bench Verified Score — Local vs Cloud Coding Models

Claude Opus 4.6

80.8%

Qwen3.6-27B (dense)

77.2%

Qwen3.6-35B-A3B (MoE)

73.4%

GPT-4o (approx)

72%

Qwen2.5 Coder 32B

50%

DeepSeek Coder V2 16B

43%

The 35B-A3B scores 3.8 points below the denser 27B. In exchange you get roughly 2.5× faster inference on the same GPU. That is not a marginal gain — at 95 tok/s versus 38 tok/s on an RTX 4090, the MoE variant feels interactive where the dense model feels sluggish. Whether that trade-off is worth it depends entirely on your use case.

How MoE Changes Your VRAM Math

Here's the critical thing MoE model users often misunderstand: you still need to load ALL 35 billion parameters into VRAM. The routing mechanism must have all expert weights accessible at inference time so it can dispatch tokens to the appropriate experts. The 3.5B active figure describes computation per token — not what's stored in memory.

VRAM requirement is determined by total parameter count. Inference speed is determined by active parameter count. You pay the storage cost of a 35B model and get the compute speed of a 3.5B model. That asymmetry is the entire value proposition.

▸Q4_K_M (recommended): ~21 GB — fits RTX 4090 (24GB) and RTX 3090 (24GB) with 3GB headroom
▸IQ4_XS (4-bit extreme): ~19 GB — slight quality reduction, more KV cache headroom on 24GB cards
▸Q3_K_M (3-bit): ~16 GB — noticeable quality drop on hard reasoning, fits 16GB GPUs when tight
▸IQ3_XXS (3-bit extreme): ~13 GB — meaningful quality loss, fits 16GB with comfortable headroom
▸Q8_0 (near-lossless): ~35 GB — requires 48GB+ VRAM or Apple M3 Max/Ultra 64GB+ unified memory

The 24GB tier is the target for this model. RTX 4090, RTX 3090, and RX 7900 XTX all land here. Apple M-series with 36GB+ unified memory handles Q4_K_M at around 45-55 tok/s — not as fast as a 4090 but convenient for MacBook users. If you only have 16GB, Q3_K_M is possible but the quality gap on hard coding tasks becomes noticeable. Use the VRAM Calculator at the bottom to check your exact GPU.

Thinking Mode: One Checkpoint, Two Personalities

Qwen3.6-35B-A3B ships as a single unified checkpoint with two operating modes. Non-thinking mode gives fast, direct answers — useful for autocomplete, boilerplate, quick Q&A. Thinking mode triggers extended chain-of-thought reasoning, materially improving performance on hard debugging tasks, algorithm design, and multi-step refactoring. You choose at inference time with a single parameter.

thinking_mode.pypython

from openai import OpenAI

client = OpenAI(
    base_url='http://localhost:11434/v1',
    api_key='ollama',  # required but ignored by Ollama
)

# Non-thinking mode — fast, direct answers
response = client.chat.completions.create(
    model='qwen3.6:35b-a3b',
    messages=[{'role': 'user', 'content': 'Write a Python function to parse ISO 8601 dates.'}],
    extra_body={'think': False},
)

# Thinking mode — deeper reasoning for hard problems
response = client.chat.completions.create(
    model='qwen3.6:35b-a3b',
    messages=[{'role': 'user', 'content': 'Debug this race condition in my async Rust code.'}],
    extra_body={'think': True},
)

print(response.choices[0].message.content)

Running It with Ollama

Ollama shipped support for the Qwen3.6 family at launch. The 35B-A3B variant is available directly from the library. The default pull uses Q4_K_M quantization — the download is approximately 21 GB. Once loaded, it stays resident in VRAM between requests so the second response comes back in seconds, not minutes.

terminalbash

# Pull the model (default Q4_K_M, ~21 GB download)
ollama pull qwen3.6:35b-a3b

# Start an interactive session
ollama run qwen3.6:35b-a3b

# Check VRAM usage after the model loads
nvidia-smi --query-gpu=memory.used,memory.total --format=csv,noheader
# Expected: roughly 21500 MiB used on a 24GB card

# Use the OpenAI-compatible API (drop-in for any tool that supports OpenAI)
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.6:35b-a3b",
    "messages": [{"role": "user", "content": "Review this Go function for correctness."}]
  }'

Running with vLLM for High-Throughput Pipelines

For agentic loops, batch code review, or any scenario with concurrent requests, vLLM outperforms Ollama significantly. vLLM 0.19.0+ has native Qwen3.6 MoE support with tensor parallel for multi-GPU setups.

terminalbash

# Install vLLM (requires CUDA 11.8+ or ROCm 5.6+)
pip install "vllm>=0.19.0"

# Single 24GB GPU with AWQ quantization (best quality at 24GB)
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3.6-35B-A3B-AWQ \
  --quantization awq \
  --max-model-len 32768 \
  --port 8000

# Multi-GPU FP16 (2× RTX 3090 for full precision)
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3.6-35B-A3B \
  --tensor-parallel-size 2 \
  --max-model-len 131072 \
  --port 8000

AWQ quantization (via vLLM) is preferable to GGUF Q4_K_M for coding tasks because it uses activation-aware calibration — meaning weights are quantized based on which values actually matter for the model's outputs, not just uniform bit reduction. The quality difference is most visible on complex multi-file refactoring tasks. If you are running Ollama for convenience, Q4_K_M is perfectly acceptable for most use cases.

35B-A3B vs 27B Dense: Which Should You Run?

This is the question anyone familiar with the Qwen3.6 family will ask. Both models are genuinely impressive. The answer comes down to whether your bottleneck is quality or latency.

Qwen3.6-27B Dense vs 35B-A3B MoE — RTX 4090 Comparison

SWE-bench: 27B dense

77score / tok/s / GB

SWE-bench: 35B-A3B

73score / tok/s / GB

tok/s: 27B dense

38score / tok/s / GB

tok/s: 35B-A3B

95score / tok/s / GB

VRAM Q4 (GB): 27B

17score / tok/s / GB

VRAM Q4 (GB): 35B-A3B

21score / tok/s / GB

▸Choose 35B-A3B if: you use the model in an IDE (Continue.dev, Cursor), agentic loops, or anywhere latency is felt interactively.
▸Choose 27B dense if: you run batch jobs overnight, nightly code reviews, or tasks where you can tolerate 2.5× longer per response.
▸Choose 35B-A3B if: you plan to use thinking mode heavily — 95 tok/s means reasoning chains that would take 3 minutes on the dense model complete in 75 seconds.
▸Choose 27B if: you have exactly 16GB VRAM and want maximum quality within that constraint (27B fits at Q4_K_M, 35B-A3B does not).
▸Choose 35B-A3B if: you do long-context work — the faster generation rate means 262K context conversations stay responsive.

Real Use Cases Where MoE Speed Makes a Concrete Difference

1.Live IDE autocomplete and chat — 95 tok/s feels instant. The model responds before you finish reading the question. 38 tok/s feels noticeably slow.
2.Agentic coding loops — When your agent generates code, runs tests, reads the error, and iterates, each cycle involves multiple model calls. 2.5× faster per call compounds across dozens of iterations.
3.Code review pipelines — Feed entire diffs into the 262K context window. The MoE speed means a 50-file PR review completes in minutes on a local machine.
4.Long reasoning sessions with thinking mode — Chain-of-thought reasoning can produce thousands of thinking tokens. At 95 tok/s, a 2,000-token reasoning chain takes 21 seconds. At 38 tok/s it takes 53 seconds.
5.Multilingual and large-context document work — The 262K context fits most mid-size codebases. The speed makes iterating on responses practical.

When TurboQuant lands in llama.cpp (targeted Q3 2026), the 262K context window becomes even more accessible. TurboQuant compresses the KV cache 4×, meaning the VRAM headroom currently eaten by a 32K context will shrink to the equivalent of an 8K context. On a 24GB GPU, that turns previously theoretical context lengths into practical everyday use.

What Hardware Do You Need?

▸RTX 4090 24GB — Best single-GPU option. ~95 tok/s at Q4_K_M. 3GB headroom for KV cache.
▸RTX 3090 24GB — Same VRAM, lower memory bandwidth. ~65 tok/s at Q4_K_M. Excellent used-market value.
▸RX 7900 XTX 24GB — AMD's 24GB option. ~55 tok/s via ROCm. Good if you already have AMD hardware.
▸Apple M3 Max 36GB+ — ~50 tok/s via Metal. Best option for a laptop that runs this model throughout the day.
▸Apple M2/M3 Ultra 64-192GB — Fastest unified memory. 60-80 tok/s. Runs Q8 without strain.
▸RTX 4070 Ti / RTX 4080 16GB — Possible with Q3_K_M (~16GB), tight. Context window headroom shrinks under load.

Verify Your Hardware Before Downloading 21GB

Before committing to a 21 GB download, it is worth confirming that your specific GPU has the headroom to run this model — especially if you run other applications alongside it. Runyard's VRAM Calculator lets you enter your GPU model and instantly see which quantization of Qwen3.6-35B-A3B fits, how much headroom remains for the KV cache, and what tokens per second to expect. Check before you download.

Check if your GPU can run Qwen3.6-35B-A3B — and find the right quantization for your VRAM.

Open the VRAM Calculator → →

March 18, 2026

Try Runyard

Find AI models that fit your exact hardware. Enter your specs and get a ranked list instantly.

Newsletter