Contents
Tags
A coding model with 3 billion active parameters outperforming models with 10–20x more active compute on real software engineering benchmarks. That sentence should not make sense — and yet it precisely describes Qwen3-Coder-Next, Alibaba's latest open-weight release from their Qwen team. Built on the Qwen3-Next-80B-A3B-Base architecture, it stores 80 billion parameters but activates only 3 billion per forward pass. The result is a model that fits on a 64 GB MacBook or a single RTX 5090, handles 256K-token context natively, scores 44.3% on SWE-Bench Pro, and exceeds 70% on SWE-Bench Verified with agent scaffolding. This is the moment local coding agents crossed a threshold that felt unreachable a year ago.
Most "coding models" are general-purpose models with extra coding data in the training mix. They understand code well, but they weren't built for the full agentic loop: reading files, making tool calls, catching runtime errors, writing tests, and iterating — all within a single long coherent session. Qwen3-Coder-Next was designed specifically for this. The technical report describes an elaborate training recipe focused on long-horizon reasoning, complex tool usage, and recovery from execution failures.
The distinction matters practically. An agentic coding session needs to hold a lot in context simultaneously — the project structure, open file contents, tool call history, error messages, test output, and partial solutions. Qwen3-Coder-Next's 256K native context window (extendable to 1M via YaRN interpolation) is not a marketing headline. It is a deliberate architectural choice for the use case. If you have ever hit context limits mid-session while running coding tasks locally, this is the model that addresses it directly.
Qwen3-Coder-Next sits on top of the Qwen3-Next-80B-A3B-Base, which combines two architectural ideas: sparse Mixture-of-Experts feed-forward layers and hybrid attention. Understanding both explains why a model with an 80B parameter count runs as fast as it does on consumer hardware.
In a dense transformer, every parameter participates in every forward pass — the full weight matrix fires for every single token. Mixture-of-Experts replaces the dense feed-forward blocks with a pool of specialized sub-networks ("experts") plus a learned router. For each token, the router selects a small fixed number of experts to activate. The remaining experts sit loaded in memory but contribute zero compute to that token.
In Qwen3-Coder-Next, only 3B of the 80B total parameters are active per forward pass. Memory bandwidth — the primary throughput bottleneck on modern GPU and unified-memory hardware — scales with active parameters, not stored size. This is why a 80B MoE with 3B active parameters can match or exceed the inference speed of some dense 7B models: the hardware is only moving 3B worth of weights through compute each step. You get the reasoning depth of 80B worth of trained knowledge at roughly the inference cost of a 3B model.
Standard full-context attention scales quadratically with sequence length — devastating at 256K tokens. Qwen3-Coder-Next uses hybrid attention that alternates between local sliding-window attention (fast, O(n) memory, covers nearby tokens) and global full-context attention (precise, applied at selected transformer layers). This keeps memory and compute manageable across the full context window while retaining the long-range reasoning needed to track variable definitions across files, follow import chains, and understand refactoring implications.
SWE-Bench tests models against real GitHub issues — the model must read an actual open-source codebase, understand a bug report, write a fix, and pass the project's own test suite. SWE-Bench Pro is the harder variant, with issues specifically selected to resist pattern-matching shortcuts. There is no way to brute-force a good score; the model has to actually reason about the code.
Qwen3-Coder-Next scores 44.3% on SWE-Bench Pro. The technical report is specific: this is performance "comparable to models with 10–20x more active parameters." That means it matches or exceeds dense models in the 30–60B active-parameter range. For local AI users, the practical implication is clear — you are no longer forced to choose between model quality and model size. The trade-off has collapsed for most real coding tasks.
On SWE-Bench Verified, the original and most widely cited benchmark in this space, Qwen3-Coder-Next exceeds 70% when paired with a SWE-Agent scaffold. Twelve months ago, 70% on SWE-Bench Verified was the frontier for the largest cloud-only models. Running it locally on consumer hardware was not a consideration anyone was having. That has genuinely changed.
The memory footprint for Qwen3-Coder-Next at Q4_K_M sits around 45 GB. You need to hold all 80B weights in memory simultaneously, even though only 3B are active at any given moment. Here is how that maps to real hardware tiers:
MoE models storage-constrained, not compute-constrained. You need enough memory to hold all 80B weights, but inference speed is determined by the 3B active parameters' memory bandwidth — not the full weight count. A machine that can hold the weights will run them fast. Use Runyard's VRAM Calculator to verify your exact hardware configuration before committing to a 40–80 GB download.
There are three practical paths to running Qwen3-Coder-Next locally. Ollama is the fastest start. llama.cpp gives you maximum control. And the OpenAI-compatible API endpoint lets you plug it into any IDE or agent scaffold that expects a cloud API.
# Path 1: Ollama (easiest — handles quantization and serving automatically)
ollama pull qwen3-coder
ollama run qwen3-coder
# Path 2: llama.cpp server (full control over context length and GPU layers)
# Download GGUF from: huggingface.co/unsloth/Qwen3-Coder-Next-GGUF
llama-server \
--model Qwen3-Coder-Next-Q4_K_M.gguf \
--ctx-size 65536 \
--n-gpu-layers 99 \
--port 8080
# Path 3: OpenAI-compatible API via Ollama (for IDE and agent integrations)
ollama serve &
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3-coder",
"messages": [
{"role": "user", "content": "Refactor this Python class to use dataclasses and add type hints"}
],
"stream": true
}'Qwen3-Coder-Next's 256K context window is its real competitive advantage in agent workflows. Set your context length explicitly — Ollama's default context can be as low as 2048 or 4096, wasting the model's full capability entirely. Add `--ctx-size 32768` (or higher if memory allows) to your llama-server launch command, or set `num_ctx` in your Ollama Modelfile to ensure you're actually using the long-context capability the model was built for.
The 70%+ SWE-Bench Verified result is achieved with SWE-Agent scaffolding, not raw prompting. The scaffold wraps the model with file-reading tools, command execution, patch application, and test verification. Qwen3-Coder-Next's training specifically targeted this interaction pattern — the model knows how to use tools, handle failed commands, recover from incorrect patches, and break multi-step problems into verifiable sub-tasks.
Open-source scaffolds that work with local models via OpenAI-compatible endpoints include SWE-Agent, OpenHands (formerly OpenDevin), Aider, and Continue.dev. All support pointing at a local Ollama or llama-server endpoint with a simple config change. The model handles the tool-use patterns natively — no additional prompt engineering or fine-tuning is required beyond the system prompt each scaffold provides.
The Qwen3-Coder family has two primary tiers: Coder-Next (80B total / 3B active) and the larger Coder-480B-A35B (480B total / 35B active). The 480B is the frontier model — it achieves 61.8% on Aider Polygot and rivals closed API models from major labs on the hardest coding benchmarks. The hardware requirement scales accordingly. Here is how to decide:
Twelve months ago, running a competitive coding agent locally meant accepting a meaningful quality gap versus cloud APIs. The best locally-runnable models scored in the mid-20% range on SWE-Bench Verified. Qwen3-Coder-Next at 70%+ on the same benchmark, on a consumer MacBook, is not incremental improvement — it is a different category of result.
The data privacy case for local coding has also never been more concrete. When your coding agent runs locally, your entire codebase, every prompt, every file the model reads — none of it leaves your machine. No training data concerns. No API usage logs to worry about. No keys to rotate or rate limits to hit at 2am. For teams on proprietary codebases, this is now achievable without meaningful quality trade-offs for the majority of real-world coding tasks.
Find out which quant of Qwen3-Coder-Next fits your exact hardware — at your target context length — before you commit to the download.
Check your hardware in the VRAM Calculator → →Tools
Find AI models that fit your exact hardware. Enter your specs and get a ranked list instantly.
Newsletter