Contents
Tags
There's a number circulating in the local AI community that stops people mid-scroll: a 122-billion-parameter model generating 60 tokens per second on battery, in a 14-inch laptop, using 73.8 GB of unified memory. That number belongs to Qwen3.5-122B-A10B on the Apple M5 Max. It's not a cherry-picked server benchmark — it's a measured result at 16K context on consumer hardware. Understanding why this is possible, and how to replicate it on your hardware, is what this post is about.
For reference: a dense 13B model on an RTX 4090 typically posts 50-70 tok/s. A 70B dense model at Q4 on the same card does around 20-30 tok/s. The relationship between model size and inference speed has always been roughly linear — more parameters means more data to process per token.
Qwen3.5-122B-A10B breaks that relationship. The "A10B" suffix means only 10 billion parameters are active on each forward pass. The remaining 112 billion sit loaded in memory but dormant for any given token. Inference throughput correlates with active parameter count, not stored size. You get the knowledge and reasoning depth of 122B worth of trained weights, generating tokens at roughly the speed of a dense 10B model.
On the M5 Max with 128 GB of unified memory and Apple's Metal-accelerated inference path, this translates to 60.6 tok/s at 16K context — or up to 65 tok/s at shorter context lengths. Fast enough to feel like real-time conversation, capable enough to compete with frontier cloud models from 18 months ago.
A standard dense transformer applies every layer to every token. Every weight participates in every forward pass. Mixture-of-Experts replaces some feed-forward layers with a collection of specialized sub-networks — "experts" — plus a learned router that selects which experts to activate for each token.
In Qwen3.5, this routing is sparse: at each MoE layer, only a small fixed number of experts fire per token. The rest are bypassed entirely — no compute, no memory bandwidth consumed for those weights during that pass. With 122B total parameters but only 10B active, roughly 92% of weights are skipped on any given token. This is why inference speed tracks active parameters, not total model size.
The second architectural component is linear attention, which replaces standard softmax self-attention in certain layers. Standard attention scales quadratically with sequence length in memory: a 128K context requires roughly 128K × 128K attention weights. Linear attention approximates this with a fixed-size recurrent state, keeping the KV cache bounded regardless of how long the conversation grows. Longer contexts don't blow up your memory the way they do with a standard transformer.
Together, sparse experts plus linear attention produce what Alibaba reports as 8-19x faster decoding versus the previous Qwen3-Max generation. The performance gain is architectural — fewer operations per token and a tighter memory footprint, not just faster hardware.
On Apple Silicon, unified memory means the entire RAM pool is available for model weights. The M5 Max with 128 GB reports 73.8 GB peak usage for Qwen3.5-122B-A10B-4bit at 16K context — leaving 54 GB for the OS, system processes, and additional context headroom. The M3/M4 Max at 96 GB fits it too, but cap your context at 32K to stay safe.
The Qwen3.5 family spans from a compact 7B to the frontier-scale 397B-A17B. Here's how each variant maps to local hardware tiers:
The MoE advantage is visible throughout this chart. The 35B-A3B on a 64 GB M4 Max posts 44 tok/s — nearly as fast as the 14B dense model on the M4 Pro, despite being 2.5x larger in total parameters. At the top, the 122B-A10B on M5 Max at 61 tok/s is 2.5x faster than the dense 72B at the same memory tier. Active parameters drive throughput, not stored ones.
Ollama's Metal backend handles hardware detection automatically on Apple Silicon. There's nothing platform-specific to configure — it detects unified memory, routes inference through Metal, and uses the full pool. The commands are identical to any other platform.
# Check your unified memory
system_profiler SPHardwareDataType | grep "Memory:"
# Pull and run the 14B for 24 GB+ setups
ollama pull qwen3.5:14b
ollama run qwen3.5:14b
# Pull and run the 122B MoE for M5 Max 128 GB
ollama pull qwen3.5:122b-a10b
ollama run qwen3.5:122b-a10b
# Monitor GPU activity while model is loaded
sudo powermetrics --samplers gpu_power -i 1000 -n 5
# Test via the OpenAI-compatible API
curl http://localhost:11434/api/chat \
-H "Content-Type: application/json" \
-d '{"model":"qwen3.5:14b","messages":[{"role":"user","content":"/nothink Explain MoE briefly."}]}'For users who want every last tok/s, Apple's MLX framework offers an alternative inference path tuned specifically for the M-series chips. The mlx-lm package runs Qwen3.5 through Apple's own Metal compute shaders. In some configurations this yields 5-15% higher throughput versus llama.cpp — the gap narrows at Q4 where memory bandwidth dominates over compute efficiency, but it's worth testing on server or continuous workloads.
# Install MLX LM
pip install mlx-lm
# Run Qwen3.5-14B via MLX (auto-downloads 4-bit weights from Hugging Face)
mlx_lm.generate \
--model mlx-community/Qwen3.5-14B-Instruct-4bit \
--prompt "/nothink What is the difference between MoE and a dense model?" \
--max-tokens 300
# For the 122B MoE on M5 Max 128 GB
mlx_lm.generate \
--model mlx-community/Qwen3.5-122B-A10B-Instruct-4bit \
--prompt "/think What are the tradeoffs between sparse and dense attention?" \
--max-tokens 1000Qwen3.5 supports explicit thinking and non-thinking modes activated by prompt prefix. In thinking mode (/think), the model emits an extended chain-of-thought trace before producing its answer — useful for hard reasoning, math, and complex debugging. In non-thinking mode (/nothink), it answers directly without internal deliberation.
At 61 tok/s on the M5 Max, a 5,000-token thinking chain takes roughly 82 seconds. That's reasonable for problems where quality matters more than speed — algorithm design, proof-checking, multi-step debugging. For everyday conversation, code completions, and quick lookups, /nothink keeps responses arriving in under two seconds.
The /think and /nothink prefixes apply per message, not to the entire session. You can mix modes freely mid-conversation — /nothink for quick clarifications, /think when you hit something genuinely hard. The model's conversation context carries over between turns regardless of which mode each message uses.
The flagship Qwen3.5-397B-A17B (Multi-GPU or Ultra hardware) posts strong numbers across the standard evals: 88.4 on GPQA Diamond (graduate-level science), 91.3 on AIME 2026 (advanced math competition), 83.6 on LiveCodeBench v6 (real-world programming tasks), and 86.7 on Tau2-Bench (multi-step agentic execution). These compete with leading closed frontier models.
The 122B-A10B scores roughly 8-12 points lower across most benchmarks — still exceptional for anything that runs on a laptop without a cloud connection. For the Qwen3.6-35B-A3B variant (the latest 35B MoE release), SWE-bench Verified comes in at 73.4%, which exceeds what frontier-class cloud models were posting 18 months ago. Quality per local compute cost is at an all-time high.
An M5 Max MacBook Pro with 128 GB lists at $4,500-5,500. At 61 tok/s on a 122B model — zero API costs, no rate limits, full data privacy, offline-capable — it's the most capable portable local AI machine available in 2026. A 10-person developer team spending $20,000-30,000 annually on cloud inference finds the payback math compelling inside 12-18 months.
If M5 Max is out of reach, the M4 Max 64 GB at around $3,000-3,500 runs Qwen3.5-35B-A3B at 44 tok/s. The M4 Pro 24 GB at $1,999 handles the 14B at 48 tok/s. Every tier of Apple Silicon now has a Qwen3.5 variant that makes real use of it. The local AI ROI calculation has never been this clear-cut.
Check which Qwen3.5 variant fits your Apple Silicon chip or GPU — weights plus KV cache at your target context length.
Open the VRAM Calculator → →Tools
Find AI models that fit your exact hardware. Enter your specs and get a ranked list instantly.
Newsletter