Contents
Tags
Alibaba released Qwen3.5-9B on March 2, 2026, and the local AI community initially looked right past it. Everyone was focused on the flagship 397B model and the 122B-A10B MoE — the poster children for raw capability. The 9B sat quietly in the same release. Then, a few weeks later, the benchmark comparisons started circulating. A 9B model outperforming OpenAI's GPT-OSS-120B — a model thirteen times its size — on three out of four key reasoning and multimodal benchmarks. Running on a laptop. With 16GB of regular RAM. No GPU required.
GPT-OSS-120B is OpenAI's open-source model intended to compete with frontier proprietary systems. At 120 billion parameters, it represents a substantial investment in model scale. The assumption has always been that more parameters means more capability. Qwen3.5-9B challenges that assumption directly on the benchmarks that matter most for practical intelligence.
GPQA Diamond is the benchmark worth pausing on. It tests doctoral-level questions in biology, chemistry, and physics — questions constructed to resist pattern-matching and require genuine domain reasoning. A 9B model scoring 81.7 here is not a fluke. It reflects a genuine architectural advance in how the model represents and retrieves scientific knowledge.
Qwen3.5's small series (0.8B through 9B) is not a scaled-down version of the flagship model. Alibaba redesigned it from scratch for efficiency at small scales. The key breakthrough is a hybrid architecture combining Gated Delta Networks with sparse Mixture-of-Experts — neither of which is a standard transformer block.
If you've run out of context on a 7B model during a long document conversation, standard attention's KV cache is the reason — it grows with every token, competing with the model weights for VRAM. GDN's recurrent component keeps the effective cache size bounded, which is why Qwen3.5-9B can hold 262K tokens in context on hardware that would OOM at 32K with a comparable transformer.
The 9B model needs approximately 6GB of VRAM at Q4 quantization — or 12-16GB of regular system RAM for CPU-only inference. That covers the majority of hardware people already own.
For CPU-only inference, the 9B runs at approximately 8-15 tokens per second on a modern Intel Core i7 or AMD Ryzen 7. Slow for real-time chat, but fully adequate for batch work: summarization, translation, code review, offline document Q&A. With an RTX 4060 (8 GB), expect 55-65 tok/s. On Apple M2 Pro unified memory (16 GB), around 25-30 tok/s.
# Install Ollama (macOS / Linux)
curl -fsSL https://ollama.ai/install.sh | sh
# Pull the 9B model (~5.8 GB download)
ollama pull qwen3.5:9b
# Start chatting
ollama run qwen3.5:9b
# CPU-only inference (no GPU required — slower but works)
OLLAMA_NUM_GPU=0 ollama run qwen3.5:9b
# Smaller variants for tighter hardware
ollama pull qwen3.5:4b # fits any 8GB VRAM card
ollama pull qwen3.5:2b # fits almost any device
ollama pull qwen3.5:0.8b # embedded / edge usefrom openai import OpenAI
import pathlib
client = OpenAI(
base_url='http://localhost:11434/v1',
api_key='ollama',
)
def summarize_document(path: str) -> str:
text = pathlib.Path(path).read_text()
resp = client.chat.completions.create(
model='qwen3.5:9b',
messages=[
{
'role': 'system',
'content': 'Summarize documents concisely. Preserve key numbers, dates, and decisions.',
},
{
'role': 'user',
'content': f'Summarize this document:\n\n{text}',
},
],
temperature=0.2,
max_tokens=1024,
)
return resp.choices[0].message.content
for doc in pathlib.Path('./documents').glob('*.txt'):
print(f'--- {doc.name} ---')
print(summarize_document(str(doc)))The 262K context window is the 9B's biggest practical advantage over every other small model. A standard business contract is 5,000-15,000 tokens. A detailed technical spec runs 30,000-80,000. Qwen3.5-9B ingests all of that in a single pass — no chunking, no lost context between segments.
The Qwen3.5-9B result fits a pattern building throughout 2026: the models winning benchmarks are doing it through architectural innovation, not raw parameter counts. Gated Delta Networks, sparse MoE at small scale, hybrid attention — these represent a genuine change in how models use their parameter budgets. Hardware that felt genuinely limited two years ago is now capable of outperforming models thirteen times its size on doctoral-level reasoning. The gap between "what runs on your machine" and "what frontier APIs offer" is closing faster than almost anyone predicted.
See which Qwen3.5 model fits your GPU — and how it compares to every other local model for your exact hardware.
Open the VRAM Calculator → →Tools
Find AI models that fit your exact hardware. Enter your specs and get a ranked list instantly.
Newsletter