Contents
Tags
On April 16, 2026, Alibaba's Qwen team released Qwen3.6 — two models that should not be as capable as they are. The 27B dense variant scores 77.2% on SWE-bench Verified, overtaking a 397-billion-parameter MoE and matching Claude 4.5 Opus on Terminal-Bench. The sibling 35B-A3B uses a Mixture-of-Experts design that activates only 3 billion parameters per token — running at roughly 70 tok/s on an RTX 4090 while delivering quality well above its apparent weight class. Both are Apache 2.0 licensed. Both are available on Ollama right now. Here is everything you need to know to get them running on your own hardware.
Qwen3.6 ships as a family of two: a dense model and a sparse Mixture-of-Experts. They target different hardware profiles and different use cases, though both are competitive on benchmark quality.
MoE models route each token through a small subset of specialist sub-networks. Qwen3.6-35B-A3B has 35B worth of world knowledge stored in its weights, but only "calls" 3B of them per token. This is why MoE models can exceed the quality you would expect from their active parameter count — they accumulate broad knowledge cheaply at inference time.
SWE-bench Verified is the benchmark that best approximates real software engineering work. Given a real GitHub issue and the repository codebase, can the model write a patch that actually fixes the problem and passes the tests? A 77.2% score means Qwen3.6-27B correctly patches more than three in four real bugs — a number that puts it in the same tier as frontier closed models that cost orders of magnitude more to run locally.
Terminal-Bench measures a different but equally important skill: can the model navigate a real Linux terminal environment, use shell utilities, manage files, and complete system-level tasks correctly? Qwen3.6-27B matches Claude 4.5 Opus here — which is a striking result for an open-weight model that runs on a single consumer GPU.
The comparison against the 397B MoE deserves a moment. A model with 397 billion total parameters — one that requires multi-GPU server hardware to run — is outperformed on coding tasks by a 27B model that fits on a single RTX 4090 with VRAM to spare. This has been the consistent pattern in Qwen's recent work: brutal parameter efficiency.
Both Qwen3.6 models are realistic on enthusiast-grade consumer hardware — but they land in different tiers.
If you have a 16GB GPU or a 24GB Mac, choose the Qwen3.6-35B-A3B at IQ3 quant over the 27B dense model. You get MoE-class knowledge breadth with a smaller footprint — and faster inference because only 3B parameters fire per token. The 27B dense is the quality-maximiser when you have the VRAM to waste.
Ollama added Qwen3.6 support within days of the April 16 release. Pull any of the following tags and you're running.
# ── Qwen3.6-27B (dense) ──────────────────────────────────────────
# ~18 GB download. Needs 18–24 GB VRAM.
ollama pull qwen3.6:27b
# Q3 quant if you have only 12–16 GB VRAM
ollama pull qwen3.6:27b-q3_K_M
# ── Qwen3.6-35B-A3B (MoE) ────────────────────────────────────────
# IQ3 (~13 GB) — best for 16 GB GPUs and 24 GB Macs
ollama pull qwen3.6:35b-a3b-iq3_s
# IQ4 (~18 GB) — best for RTX 4090 / M4 Max
ollama pull qwen3.6:35b-a3b-iq4_xs
# ── Run interactively ─────────────────────────────────────────────
ollama run qwen3.6:27b
# ── List what you have downloaded ─────────────────────────────────
ollama list
# ── Use via the OpenAI-compatible API (port 11434) ────────────────
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3.6:35b-a3b-iq4_xs",
"messages": [{"role":"user","content":"Write a Go HTTP handler that streams SSE events."}]
}'The 35B-A3B MoE's trick becomes clear in the tok/s numbers. Because only 3 billion parameters fire per token, the GPU's memory bandwidth is doing far less work per decode step than it would on a comparable dense model. These figures are for Qwen3.6-35B-A3B at Q4_K_M:
70 tok/s on an RTX 4090 for a model with 35B total parameters is remarkable. For comparison, a 27B dense model at Q4_K_M on the same hardware typically runs at 35–45 tok/s. The MoE delivers close to 2x the throughput because the GPU is only doing the matrix math for 3B active parameters, not 27B, on every single decode step. For interactive chat and IDE autocomplete, this difference is felt immediately.
The SWE-bench score reflects genuine strengths that translate to practical day-to-day tasks.
Ollama's OpenAI-compatible API means Qwen3.6 drops into any tool that supports custom model endpoints. Here's a minimal Python client for coding workflows:
from openai import OpenAI
# Point at your local Ollama server
client = OpenAI(
base_url='http://localhost:11434/v1',
api_key='ollama', # required field, value is ignored
)
response = client.chat.completions.create(
model='qwen3.6:27b', # or 'qwen3.6:35b-a3b-iq4_xs'
messages=[
{
'role': 'system',
'content': (
'You are a senior software engineer. '
'Fix bugs precisely. No explanation unless asked.'
),
},
{
'role': 'user',
'content': open('buggy_module.py').read(),
},
],
temperature=0.1, # low temp = more deterministic code output
max_tokens=2048,
)
print(response.choices[0].message.content)For IDE integration: Continue.dev works with Ollama out of the box — add a custom model pointing to http://localhost:11434 and select your Qwen3.6 tag. Aider supports Ollama backends directly via the --openai-api-base flag. Both tools benefit from the MoE variant's higher throughput during autocomplete sessions.
A year ago, the gap between the best local coding models and frontier API models was wide enough that it shaped real decisions: use the API for anything that matters, use local for experiments. Qwen3.6 is part of a pattern that has been narrowing that gap faster than most people expected.
The honest framing: Qwen3.6-27B is not universally identical to Claude 4.5 Opus. On the hardest agentic tasks — multi-step tool use, long-horizon planning, ambiguous requirements — frontier closed models still have a quality edge. But on the 80–90% of real-world coding prompts that developers send every day — debug this function, write a test for this class, explain this regex, refactor this module — the difference in output quality is small enough that most developers would not reliably distinguish them in a blind test.
The calculus for local AI is shifting: if you are currently paying per-token for a coding API, it is worth running a genuine comparison before your next billing cycle. Pull Qwen3.6 locally, send it your last 20 real coding tasks, and evaluate the outputs honestly. For many developers, the 10–20% of tasks where they truly need frontier quality is when they reach for the API. The rest is now squarely within local reach.
Not sure if your GPU can handle Qwen3.6? The Runyard VRAM Calculator at runyard.dev/tools/vram-calculator shows the exact quant level that fits your card, expected tok/s, and whether the 27B dense or the 35B-A3B MoE is the smarter pick for your VRAM — before you commit to a 15GB download.
Find out which Qwen3.6 variant runs on your GPU and how it ranks against every other local coding model for your hardware.
Open the VRAM Calculator → →Tools
Find AI models that fit your exact hardware. Enter your specs and get a ranked list instantly.
Newsletter