← Blog/Qwen3.5-9B Beats OpenAI's 120B Model — While Running on Your Laptop
Runyard.dev — Find AI Models That Run on Your Hardware

Qwen3.5-9B Beats OpenAI's 120B Model — While Running on Your Laptop

Laptop running local AI models without cloud dependency
Qwen3.5-9B runs on any modern laptop with 16GB RAM — no dedicated GPU required.

Alibaba released Qwen3.5-9B on March 2, 2026, and the local AI community initially looked right past it. Everyone was focused on the flagship 397B model and the 122B-A10B MoE — the poster children for raw capability. The 9B sat quietly in the same release. Then, a few weeks later, the benchmark comparisons started circulating. A 9B model outperforming OpenAI's GPT-OSS-120B — a model thirteen times its size — on three out of four key reasoning and multimodal benchmarks. Running on a laptop. With 16GB of regular RAM. No GPU required.

The Numbers That Stopped the Community Cold

GPT-OSS-120B is OpenAI's open-source model intended to compete with frontier proprietary systems. At 120 billion parameters, it represents a substantial investment in model scale. The assumption has always been that more parameters means more capability. Qwen3.5-9B challenges that assumption directly on the benchmarks that matter most for practical intelligence.

Qwen3.5-9B vs GPT-OSS-120B — Key Benchmark Scores
Qwen3.5-9B — GPQA Diamond
81.7 pts
GPT-OSS-120B — GPQA Diamond
80.1 pts
Qwen3.5-9B — MMLU-Pro
82.5 pts
GPT-OSS-120B — MMLU-Pro
80.8 pts
Qwen3.5-9B — MMMU-Pro
70.1 pts
GPT-OSS-120B — MMMU-Pro
59.7 pts
  • GPQA Diamond (doctoral-level science reasoning): 81.7 vs 80.1 — Qwen3.5-9B wins by 1.6 pts
  • MMLU-Pro (professional knowledge breadth across 57 subjects): 82.5 vs 80.8 — Qwen3.5-9B wins
  • MMMU-Pro (multimodal visual reasoning): 70.1 vs 59.7 — Qwen3.5-9B wins by 10.4 pts
  • MMMLU (multilingual understanding across 200+ languages): 81.2 vs 78.2 — Qwen3.5-9B wins

GPQA Diamond is the benchmark worth pausing on. It tests doctoral-level questions in biology, chemistry, and physics — questions constructed to resist pattern-matching and require genuine domain reasoning. A 9B model scoring 81.7 here is not a fluke. It reflects a genuine architectural advance in how the model represents and retrieves scientific knowledge.

What Changed in the Architecture

Qwen3.5's small series (0.8B through 9B) is not a scaled-down version of the flagship model. Alibaba redesigned it from scratch for efficiency at small scales. The key breakthrough is a hybrid architecture combining Gated Delta Networks with sparse Mixture-of-Experts — neither of which is a standard transformer block.

  • Gated Delta Networks (GDN) — A recurrent mechanism that handles long-range dependencies at O(n) memory cost instead of the O(n²) cost of standard self-attention. This is what makes the 262K context window practical without a 40GB GPU.
  • Sparse MoE at small scale — Only a subset of weights activates per token. The effective compute per inference step is lower than a 9B dense model while total parameter capacity is preserved.
  • 262,144-token native context window — Most 7B-9B models cap at 8K-32K. The GDN hybrid component is what makes this feasible on consumer RAM.
  • 201 languages and dialects — Trained with balanced multilingual data. The MMMLU benchmark result (81.2) confirms it holds quality across low-resource languages.
  • Apache 2.0 license — Fully commercial, modify and deploy freely.

If you've run out of context on a 7B model during a long document conversation, standard attention's KV cache is the reason — it grows with every token, competing with the model weights for VRAM. GDN's recurrent component keeps the effective cache size bounded, which is why Qwen3.5-9B can hold 262K tokens in context on hardware that would OOM at 32K with a comparable transformer.

The Hardware Requirements Are Genuinely Accessible

The 9B model needs approximately 6GB of VRAM at Q4 quantization — or 12-16GB of regular system RAM for CPU-only inference. That covers the majority of hardware people already own.

Qwen3.5 Small Series — VRAM Required at Q4 Quantization
Qwen3.5-0.8B
0.8 GB
Qwen3.5-2B
1.6 GB
Qwen3.5-4B
3 GB
Qwen3.5-9B
6 GB
  • Qwen3.5-0.8B — Runs on a Raspberry Pi, phone, or embedded device. ~0.8 GB VRAM or 2 GB RAM.
  • Qwen3.5-2B — Smart speaker class hardware. ~1.6 GB VRAM or 4 GB RAM. Full 201-language support.
  • Qwen3.5-4B — Any laptop with 8 GB RAM. Basic entry-level GPU. Competitive with many 7B models.
  • Qwen3.5-9B — The benchmark champion. 6 GB VRAM on an RTX 4060, or 12-16 GB system RAM for CPU inference.

For CPU-only inference, the 9B runs at approximately 8-15 tokens per second on a modern Intel Core i7 or AMD Ryzen 7. Slow for real-time chat, but fully adequate for batch work: summarization, translation, code review, offline document Q&A. With an RTX 4060 (8 GB), expect 55-65 tok/s. On Apple M2 Pro unified memory (16 GB), around 25-30 tok/s.

How to Run It Today with Ollama

terminalbash
# Install Ollama (macOS / Linux)
curl -fsSL https://ollama.ai/install.sh | sh

# Pull the 9B model (~5.8 GB download)
ollama pull qwen3.5:9b

# Start chatting
ollama run qwen3.5:9b

# CPU-only inference (no GPU required — slower but works)
OLLAMA_NUM_GPU=0 ollama run qwen3.5:9b

# Smaller variants for tighter hardware
ollama pull qwen3.5:4b   # fits any 8GB VRAM card
ollama pull qwen3.5:2b   # fits almost any device
ollama pull qwen3.5:0.8b # embedded / edge use
summarize.pypython
from openai import OpenAI
import pathlib

client = OpenAI(
    base_url='http://localhost:11434/v1',
    api_key='ollama',
)

def summarize_document(path: str) -> str:
    text = pathlib.Path(path).read_text()
    resp = client.chat.completions.create(
        model='qwen3.5:9b',
        messages=[
            {
                'role': 'system',
                'content': 'Summarize documents concisely. Preserve key numbers, dates, and decisions.',
            },
            {
                'role': 'user',
                'content': f'Summarize this document:\n\n{text}',
            },
        ],
        temperature=0.2,
        max_tokens=1024,
    )
    return resp.choices[0].message.content

for doc in pathlib.Path('./documents').glob('*.txt'):
    print(f'--- {doc.name} ---')
    print(summarize_document(str(doc)))

What the 9B Actually Handles Well

  • ✅ Long document Q&A and summarization — The 262K context is the decisive edge. Feed entire research papers, contracts, or transcripts without chunking.
  • ✅ Multilingual work — Translation and cross-lingual Q&A across 201 languages, including low-resource ones most models degrade on.
  • ✅ Scientific and technical Q&A — The 81.7 GPQA Diamond score reflects genuine scientific reasoning, not pattern-matching on memorized answers.
  • ✅ Code explanation and debugging — Competitive on HumanEval for explanation, debugging, and short generation tasks.
  • ✅ Offline private document analysis — Sensitive files never leave the machine. Zero API cost.
  • ⚠️ Extended chain-of-thought math — On very hard multi-step derivations, the 27B dense or 35B-A3B MoE has a clear edge.
  • ⚠️ Large codebase agentic tasks — For multi-file edits across a full repository, Qwen3.6-27B is the better pick.

The 262K context window is the 9B's biggest practical advantage over every other small model. A standard business contract is 5,000-15,000 tokens. A detailed technical spec runs 30,000-80,000. Qwen3.5-9B ingests all of that in a single pass — no chunking, no lost context between segments.

Tokens Per Second on Real Hardware

Qwen3.5-9B Q4_K_M — Estimated Tokens / Second by Hardware
RTX 4090 24GB
88 tok/s
RTX 4070 Ti 16GB
72 tok/s
RTX 4060 8GB
58 tok/s
Apple M3 Max 64GB
68 tok/s
Apple M2 Pro 16GB
28 tok/s
CPU Ryzen 9 / Core i9
13 tok/s

The Bigger Pattern: Efficiency Is 2026's Real Story

The Qwen3.5-9B result fits a pattern building throughout 2026: the models winning benchmarks are doing it through architectural innovation, not raw parameter counts. Gated Delta Networks, sparse MoE at small scale, hybrid attention — these represent a genuine change in how models use their parameter budgets. Hardware that felt genuinely limited two years ago is now capable of outperforming models thirteen times its size on doctoral-level reasoning. The gap between "what runs on your machine" and "what frontier APIs offer" is closing faster than almost anyone predicted.

See which Qwen3.5 model fits your GPU — and how it compares to every other local model for your exact hardware.

Open the VRAM Calculator →

RUNYARD.DEV

Hardware-aware AI model discovery. Know exactly what runs on your machine — before you download.

© 2026 RUNYARD.DEV — All rights reserved.

Built for local AI.

Tools

Try Runyard

Find AI models that fit your exact hardware. Enter your specs and get a ranked list instantly.

Newsletter