Contents
Tags
The Qwen3.6 family landed in April 2026 and the headline model was the dense 27B — 77.2% SWE-bench, great numbers. But the variant that deserves more attention for local AI runners is the 35B-A3B: a Mixture-of-Experts architecture that loads 35 billion parameters but only routes through 3.5 billion per token. The result is 73.4% on SWE-bench Verified at a compute cost closer to a 3B model than a 35B. It generates tokens on an RTX 4090 at roughly 95 tok/s — more than twice the speed of the dense 27B on the same GPU. And almost nobody in the local AI community is talking about it yet.
Most local model discussions focus on total parameter count — 7B, 13B, 70B. For dense models, total params is the right metric because every parameter is used on every token. MoE (Mixture-of-Experts) models break that assumption entirely. Qwen3.6-35B-A3B has 35 billion total parameters, but at inference time only 3.5 billion are active per forward pass. The rest sit loaded in VRAM, available but idle for that particular token.
LLM inference speed is bottlenecked by how many parameters you compute through per token — not by how many are loaded. A 35B-A3B model generates tokens at roughly the speed of a 3.5B dense model, while delivering output quality that required 35B parameters to train. You pay the VRAM cost of a 35B model and get the latency of a 3.5B model. That trade-off is specifically what makes MoE architectures so interesting for local runners who care about responsiveness.
SWE-bench Verified is the hardest coding benchmark widely used in 2026. It tests models on real GitHub issues requiring understanding of a full codebase, identifying the root cause of a bug, and writing a patch that passes all existing tests. A 73.4% score puts Qwen3.6-35B-A3B well into the range of models that can handle genuine software engineering tasks — not just toy LeetCode problems.
The 35B-A3B scores 3.8 points below the denser 27B. In exchange you get roughly 2.5× faster inference on the same GPU. That is not a marginal gain — at 95 tok/s versus 38 tok/s on an RTX 4090, the MoE variant feels interactive where the dense model feels sluggish. Whether that trade-off is worth it depends entirely on your use case.
Here's the critical thing MoE model users often misunderstand: you still need to load ALL 35 billion parameters into VRAM. The routing mechanism must have all expert weights accessible at inference time so it can dispatch tokens to the appropriate experts. The 3.5B active figure describes computation per token — not what's stored in memory.
VRAM requirement is determined by total parameter count. Inference speed is determined by active parameter count. You pay the storage cost of a 35B model and get the compute speed of a 3.5B model. That asymmetry is the entire value proposition.
The 24GB tier is the target for this model. RTX 4090, RTX 3090, and RX 7900 XTX all land here. Apple M-series with 36GB+ unified memory handles Q4_K_M at around 45-55 tok/s — not as fast as a 4090 but convenient for MacBook users. If you only have 16GB, Q3_K_M is possible but the quality gap on hard coding tasks becomes noticeable. Use the VRAM Calculator at the bottom to check your exact GPU.
Qwen3.6-35B-A3B ships as a single unified checkpoint with two operating modes. Non-thinking mode gives fast, direct answers — useful for autocomplete, boilerplate, quick Q&A. Thinking mode triggers extended chain-of-thought reasoning, materially improving performance on hard debugging tasks, algorithm design, and multi-step refactoring. You choose at inference time with a single parameter.
from openai import OpenAI
client = OpenAI(
base_url='http://localhost:11434/v1',
api_key='ollama', # required but ignored by Ollama
)
# Non-thinking mode — fast, direct answers
response = client.chat.completions.create(
model='qwen3.6:35b-a3b',
messages=[{'role': 'user', 'content': 'Write a Python function to parse ISO 8601 dates.'}],
extra_body={'think': False},
)
# Thinking mode — deeper reasoning for hard problems
response = client.chat.completions.create(
model='qwen3.6:35b-a3b',
messages=[{'role': 'user', 'content': 'Debug this race condition in my async Rust code.'}],
extra_body={'think': True},
)
print(response.choices[0].message.content)Ollama shipped support for the Qwen3.6 family at launch. The 35B-A3B variant is available directly from the library. The default pull uses Q4_K_M quantization — the download is approximately 21 GB. Once loaded, it stays resident in VRAM between requests so the second response comes back in seconds, not minutes.
# Pull the model (default Q4_K_M, ~21 GB download)
ollama pull qwen3.6:35b-a3b
# Start an interactive session
ollama run qwen3.6:35b-a3b
# Check VRAM usage after the model loads
nvidia-smi --query-gpu=memory.used,memory.total --format=csv,noheader
# Expected: roughly 21500 MiB used on a 24GB card
# Use the OpenAI-compatible API (drop-in for any tool that supports OpenAI)
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3.6:35b-a3b",
"messages": [{"role": "user", "content": "Review this Go function for correctness."}]
}'For agentic loops, batch code review, or any scenario with concurrent requests, vLLM outperforms Ollama significantly. vLLM 0.19.0+ has native Qwen3.6 MoE support with tensor parallel for multi-GPU setups.
# Install vLLM (requires CUDA 11.8+ or ROCm 5.6+)
pip install "vllm>=0.19.0"
# Single 24GB GPU with AWQ quantization (best quality at 24GB)
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3.6-35B-A3B-AWQ \
--quantization awq \
--max-model-len 32768 \
--port 8000
# Multi-GPU FP16 (2× RTX 3090 for full precision)
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3.6-35B-A3B \
--tensor-parallel-size 2 \
--max-model-len 131072 \
--port 8000AWQ quantization (via vLLM) is preferable to GGUF Q4_K_M for coding tasks because it uses activation-aware calibration — meaning weights are quantized based on which values actually matter for the model's outputs, not just uniform bit reduction. The quality difference is most visible on complex multi-file refactoring tasks. If you are running Ollama for convenience, Q4_K_M is perfectly acceptable for most use cases.
This is the question anyone familiar with the Qwen3.6 family will ask. Both models are genuinely impressive. The answer comes down to whether your bottleneck is quality or latency.
When TurboQuant lands in llama.cpp (targeted Q3 2026), the 262K context window becomes even more accessible. TurboQuant compresses the KV cache 4×, meaning the VRAM headroom currently eaten by a 32K context will shrink to the equivalent of an 8K context. On a 24GB GPU, that turns previously theoretical context lengths into practical everyday use.
Before committing to a 21 GB download, it is worth confirming that your specific GPU has the headroom to run this model — especially if you run other applications alongside it. Runyard's VRAM Calculator lets you enter your GPU model and instantly see which quantization of Qwen3.6-35B-A3B fits, how much headroom remains for the KV cache, and what tokens per second to expect. Check before you download.
Check if your GPU can run Qwen3.6-35B-A3B — and find the right quantization for your VRAM.
Open the VRAM Calculator → →Tools
Find AI models that fit your exact hardware. Enter your specs and get a ranked list instantly.
Newsletter