Runyard is a free hardware-aware AI model browser. You enter your CPU, GPU, and VRAM and it instantly shows every local LLM that will run on your machine, ranked by speed and quality.

How much VRAM do I need to run local LLMs?

8GB of VRAM runs 7B models like Llama 3.1 8B and Mistral 7B at Q4 quantization. 16GB unlocks 13B models. 24GB lets you run Mixtral 8x7B and Llama 3 70B at lower quantization.

What is the best local LLM for my GPU?

Use Runyard at www.runyard.dev — enter your GPU and VRAM and the Model Radar will rank every compatible LLM for your exact hardware, showing estimated tokens per second for each model.

Can I run Llama 3 locally?

Yes. Llama 3.1 8B at Q4 runs on any 8GB VRAM GPU. Llama 3.1 70B needs around 40GB VRAM at Q4, or an Apple Silicon Mac with 64GB+ unified memory.

← Blog/Qwen3.5-9B Beats OpenAI's 120B Model — While Running on Your Laptop

May 6, 2026local-ai

Runyard Team

@runyard_dev

12 min read

Contents

▸The Numbers That Stopped the Community Cold ▸What Changed in the Architecture ▸The Hardware Requirements Are Genuinely Accessible ▸How to Run It Today with Ollama ▸What the 9B Actually Handles Well ▸Tokens Per Second on Real Hardware ▸The Bigger Pattern: Efficiency Is 2026's Real Story

Qwen3.5-9B Beats OpenAI's 120B Model — While Running on Your Laptop

Laptop running local AI models without cloud dependency — Qwen3.5-9B runs on any modern laptop with 16GB RAM — no dedicated GPU required.

Alibaba released Qwen3.5-9B on March 2, 2026, and the local AI community initially looked right past it. Everyone was focused on the flagship 397B model and the 122B-A10B MoE — the poster children for raw capability. The 9B sat quietly in the same release. Then, a few weeks later, the benchmark comparisons started circulating. A 9B model outperforming OpenAI's GPT-OSS-120B — a model thirteen times its size — on three out of four key reasoning and multimodal benchmarks. Running on a laptop. With 16GB of regular RAM. No GPU required.

The Numbers That Stopped the Community Cold

GPT-OSS-120B is OpenAI's open-source model intended to compete with frontier proprietary systems. At 120 billion parameters, it represents a substantial investment in model scale. The assumption has always been that more parameters means more capability. Qwen3.5-9B challenges that assumption directly on the benchmarks that matter most for practical intelligence.

Qwen3.5-9B vs GPT-OSS-120B — Key Benchmark Scores

Qwen3.5-9B — GPQA Diamond

81.7 pts

GPT-OSS-120B — GPQA Diamond

80.1 pts

Qwen3.5-9B — MMLU-Pro

82.5 pts

GPT-OSS-120B — MMLU-Pro

80.8 pts

Qwen3.5-9B — MMMU-Pro

70.1 pts

GPT-OSS-120B — MMMU-Pro

59.7 pts

▸GPQA Diamond (doctoral-level science reasoning): 81.7 vs 80.1 — Qwen3.5-9B wins by 1.6 pts
▸MMLU-Pro (professional knowledge breadth across 57 subjects): 82.5 vs 80.8 — Qwen3.5-9B wins
▸MMMU-Pro (multimodal visual reasoning): 70.1 vs 59.7 — Qwen3.5-9B wins by 10.4 pts
▸MMMLU (multilingual understanding across 200+ languages): 81.2 vs 78.2 — Qwen3.5-9B wins

GPQA Diamond is the benchmark worth pausing on. It tests doctoral-level questions in biology, chemistry, and physics — questions constructed to resist pattern-matching and require genuine domain reasoning. A 9B model scoring 81.7 here is not a fluke. It reflects a genuine architectural advance in how the model represents and retrieves scientific knowledge.

What Changed in the Architecture

Qwen3.5's small series (0.8B through 9B) is not a scaled-down version of the flagship model. Alibaba redesigned it from scratch for efficiency at small scales. The key breakthrough is a hybrid architecture combining Gated Delta Networks with sparse Mixture-of-Experts — neither of which is a standard transformer block.

▸Gated Delta Networks (GDN) — A recurrent mechanism that handles long-range dependencies at O(n) memory cost instead of the O(n²) cost of standard self-attention. This is what makes the 262K context window practical without a 40GB GPU.
▸Sparse MoE at small scale — Only a subset of weights activates per token. The effective compute per inference step is lower than a 9B dense model while total parameter capacity is preserved.
▸262,144-token native context window — Most 7B-9B models cap at 8K-32K. The GDN hybrid component is what makes this feasible on consumer RAM.
▸201 languages and dialects — Trained with balanced multilingual data. The MMMLU benchmark result (81.2) confirms it holds quality across low-resource languages.
▸Apache 2.0 license — Fully commercial, modify and deploy freely.

If you've run out of context on a 7B model during a long document conversation, standard attention's KV cache is the reason — it grows with every token, competing with the model weights for VRAM. GDN's recurrent component keeps the effective cache size bounded, which is why Qwen3.5-9B can hold 262K tokens in context on hardware that would OOM at 32K with a comparable transformer.

The Hardware Requirements Are Genuinely Accessible

The 9B model needs approximately 6GB of VRAM at Q4 quantization — or 12-16GB of regular system RAM for CPU-only inference. That covers the majority of hardware people already own.

Qwen3.5 Small Series — VRAM Required at Q4 Quantization

Qwen3.5-0.8B

0.8 GB

Qwen3.5-2B

1.6 GB

Qwen3.5-4B

3 GB

Qwen3.5-9B

6 GB

▸Qwen3.5-0.8B — Runs on a Raspberry Pi, phone, or embedded device. ~0.8 GB VRAM or 2 GB RAM.
▸Qwen3.5-2B — Smart speaker class hardware. ~1.6 GB VRAM or 4 GB RAM. Full 201-language support.
▸Qwen3.5-4B — Any laptop with 8 GB RAM. Basic entry-level GPU. Competitive with many 7B models.
▸Qwen3.5-9B — The benchmark champion. 6 GB VRAM on an RTX 4060, or 12-16 GB system RAM for CPU inference.

For CPU-only inference, the 9B runs at approximately 8-15 tokens per second on a modern Intel Core i7 or AMD Ryzen 7. Slow for real-time chat, but fully adequate for batch work: summarization, translation, code review, offline document Q&A. With an RTX 4060 (8 GB), expect 55-65 tok/s. On Apple M2 Pro unified memory (16 GB), around 25-30 tok/s.

How to Run It Today with Ollama

terminalbash

# Install Ollama (macOS / Linux)
curl -fsSL https://ollama.ai/install.sh | sh

# Pull the 9B model (~5.8 GB download)
ollama pull qwen3.5:9b

# Start chatting
ollama run qwen3.5:9b

# CPU-only inference (no GPU required — slower but works)
OLLAMA_NUM_GPU=0 ollama run qwen3.5:9b

# Smaller variants for tighter hardware
ollama pull qwen3.5:4b   # fits any 8GB VRAM card
ollama pull qwen3.5:2b   # fits almost any device
ollama pull qwen3.5:0.8b # embedded / edge use

summarize.pypython

from openai import OpenAI
import pathlib

client = OpenAI(
    base_url='http://localhost:11434/v1',
    api_key='ollama',
)

def summarize_document(path: str) -> str:
    text = pathlib.Path(path).read_text()
    resp = client.chat.completions.create(
        model='qwen3.5:9b',
        messages=[
            {
                'role': 'system',
                'content': 'Summarize documents concisely. Preserve key numbers, dates, and decisions.',
            },
            {
                'role': 'user',
                'content': f'Summarize this document:\n\n{text}',
            },
        ],
        temperature=0.2,
        max_tokens=1024,
    )
    return resp.choices[0].message.content

for doc in pathlib.Path('./documents').glob('*.txt'):
    print(f'--- {doc.name} ---')
    print(summarize_document(str(doc)))

What the 9B Actually Handles Well

▸✅ Long document Q&A and summarization — The 262K context is the decisive edge. Feed entire research papers, contracts, or transcripts without chunking.
▸✅ Multilingual work — Translation and cross-lingual Q&A across 201 languages, including low-resource ones most models degrade on.
▸✅ Scientific and technical Q&A — The 81.7 GPQA Diamond score reflects genuine scientific reasoning, not pattern-matching on memorized answers.
▸✅ Code explanation and debugging — Competitive on HumanEval for explanation, debugging, and short generation tasks.
▸✅ Offline private document analysis — Sensitive files never leave the machine. Zero API cost.
▸⚠️ Extended chain-of-thought math — On very hard multi-step derivations, the 27B dense or 35B-A3B MoE has a clear edge.
▸⚠️ Large codebase agentic tasks — For multi-file edits across a full repository, Qwen3.6-27B is the better pick.

The 262K context window is the 9B's biggest practical advantage over every other small model. A standard business contract is 5,000-15,000 tokens. A detailed technical spec runs 30,000-80,000. Qwen3.5-9B ingests all of that in a single pass — no chunking, no lost context between segments.

Tokens Per Second on Real Hardware

Qwen3.5-9B Q4_K_M — Estimated Tokens / Second by Hardware

RTX 4090 24GB

88 tok/s

RTX 4070 Ti 16GB

72 tok/s

RTX 4060 8GB

58 tok/s

Apple M3 Max 64GB

68 tok/s

Apple M2 Pro 16GB

28 tok/s

CPU Ryzen 9 / Core i9

13 tok/s

The Bigger Pattern: Efficiency Is 2026's Real Story

The Qwen3.5-9B result fits a pattern building throughout 2026: the models winning benchmarks are doing it through architectural innovation, not raw parameter counts. Gated Delta Networks, sparse MoE at small scale, hybrid attention — these represent a genuine change in how models use their parameter budgets. Hardware that felt genuinely limited two years ago is now capable of outperforming models thirteen times its size on doctoral-level reasoning. The gap between "what runs on your machine" and "what frontier APIs offer" is closing faster than almost anyone predicted.

See which Qwen3.5 model fits your GPU — and how it compares to every other local model for your exact hardware.

Open the VRAM Calculator → →

March 18, 2026

Try Runyard

Find AI models that fit your exact hardware. Enter your specs and get a ranked list instantly.

Newsletter