Runyard is a free hardware-aware AI model browser. You enter your CPU, GPU, and VRAM and it instantly shows every local LLM that will run on your machine, ranked by speed and quality.

How much VRAM do I need to run local LLMs?

8GB of VRAM runs 7B models like Llama 3.1 8B and Mistral 7B at Q4 quantization. 16GB unlocks 13B models. 24GB lets you run Mixtral 8x7B and Llama 3 70B at lower quantization.

What is the best local LLM for my GPU?

Use Runyard at www.runyard.dev — enter your GPU and VRAM and the Model Radar will rank every compatible LLM for your exact hardware, showing estimated tokens per second for each model.

Can I run Llama 3 locally?

Yes. Llama 3.1 8B at Q4 runs on any 8GB VRAM GPU. Llama 3.1 70B needs around 40GB VRAM at Q4, or an Apple Silicon Mac with 64GB+ unified memory.

← Blog/Qwen3.5 MoE: 60 tok/s on a Laptop — How to Run the 122B Model Locally

May 2, 2026local-ai

Runyard Team

@runyard_dev

12 min read

Contents

▸Why 60 tok/s From a 122B Model Is a Big Deal ▸The Architecture: How Sparse MoE and Linear Attention Work Together ▸The Full Qwen3.5 MoE Lineup: Which Variant for Which Hardware ▸Setting Up: Ollama on Apple Silicon Step by Step ▸MLX: Apple's Native Inference Path ▸Thinking Mode: Speed vs Quality Per Task ▸Benchmark Snapshot: Where Qwen3.5 Actually Stands ▸Is the Hardware Investment Worth It?

Qwen3.5 MoE: 60 tok/s on a Laptop — How to Run the 122B Model Locally

Laptop with code on screen representing local AI model inference on Apple Silicon — Qwen3.5's sparse MoE design means your laptop's unified memory works much harder than raw parameter counts suggest.

There's a number circulating in the local AI community that stops people mid-scroll: a 122-billion-parameter model generating 60 tokens per second on battery, in a 14-inch laptop, using 73.8 GB of unified memory. That number belongs to Qwen3.5-122B-A10B on the Apple M5 Max. It's not a cherry-picked server benchmark — it's a measured result at 16K context on consumer hardware. Understanding why this is possible, and how to replicate it on your hardware, is what this post is about.

Why 60 tok/s From a 122B Model Is a Big Deal

For reference: a dense 13B model on an RTX 4090 typically posts 50-70 tok/s. A 70B dense model at Q4 on the same card does around 20-30 tok/s. The relationship between model size and inference speed has always been roughly linear — more parameters means more data to process per token.

Qwen3.5-122B-A10B breaks that relationship. The "A10B" suffix means only 10 billion parameters are active on each forward pass. The remaining 112 billion sit loaded in memory but dormant for any given token. Inference throughput correlates with active parameter count, not stored size. You get the knowledge and reasoning depth of 122B worth of trained weights, generating tokens at roughly the speed of a dense 10B model.

On the M5 Max with 128 GB of unified memory and Apple's Metal-accelerated inference path, this translates to 60.6 tok/s at 16K context — or up to 65 tok/s at shorter context lengths. Fast enough to feel like real-time conversation, capable enough to compete with frontier cloud models from 18 months ago.

The Architecture: How Sparse MoE and Linear Attention Work Together

A standard dense transformer applies every layer to every token. Every weight participates in every forward pass. Mixture-of-Experts replaces some feed-forward layers with a collection of specialized sub-networks — "experts" — plus a learned router that selects which experts to activate for each token.

In Qwen3.5, this routing is sparse: at each MoE layer, only a small fixed number of experts fire per token. The rest are bypassed entirely — no compute, no memory bandwidth consumed for those weights during that pass. With 122B total parameters but only 10B active, roughly 92% of weights are skipped on any given token. This is why inference speed tracks active parameters, not total model size.

The second architectural component is linear attention, which replaces standard softmax self-attention in certain layers. Standard attention scales quadratically with sequence length in memory: a 128K context requires roughly 128K × 128K attention weights. Linear attention approximates this with a fixed-size recurrent state, keeping the KV cache bounded regardless of how long the conversation grows. Longer contexts don't blow up your memory the way they do with a standard transformer.

Together, sparse experts plus linear attention produce what Alibaba reports as 8-19x faster decoding versus the previous Qwen3-Max generation. The performance gain is architectural — fewer operations per token and a tighter memory footprint, not just faster hardware.

On Apple Silicon, unified memory means the entire RAM pool is available for model weights. The M5 Max with 128 GB reports 73.8 GB peak usage for Qwen3.5-122B-A10B-4bit at 16K context — leaving 54 GB for the OS, system processes, and additional context headroom. The M3/M4 Max at 96 GB fits it too, but cap your context at 32K to stay safe.

The Full Qwen3.5 MoE Lineup: Which Variant for Which Hardware

The Qwen3.5 family spans from a compact 7B to the frontier-scale 397B-A17B. Here's how each variant maps to local hardware tiers:

▸Qwen3.5-7B (dense) — ~4.5 GB at Q4_K_M. Any 8 GB GPU or laptop with 16 GB unified memory. Fast everyday model.
▸Qwen3.5-14B (dense) — ~8.5 GB at Q4_K_M. RTX 4060 (8 GB), M4 Pro 18-24 GB. Best value dense model in the family.
▸Qwen3.5-32B (dense) — ~19 GB at Q4_K_M. RTX 4090 24 GB, M4 Max 36+ GB. Major quality upgrade over 14B.
▸Qwen3.5-35B-A3B (MoE) — ~22 GB at Q4_K_M. RTX 4090 just fits. M3/M4 Max 36+ GB. Only 3B active parameters per token.
▸Qwen3.5-122B-A10B (MoE) — ~73.8 GB at Q4_K_M. M5 Max 128 GB; M3/M4 Max 96 GB (tight). 10B active parameters.
▸Qwen3.5-397B-A17B (MoE) — ~220 GB at Q4_K_M. M2/M3/M4/M5 Ultra 192+ GB, or multi-GPU setups. Frontier-class quality.

Qwen3.5 Inference Speed on Apple Silicon (tok/s, 16K context, non-thinking mode)

M4 Pro 24GB — 14B Q4

48tok/s

M4 Max 36GB — 32B Q4

31tok/s

M4 Max 64GB — 35B-A3B MoE Q4

44tok/s

M4 Max 96GB — 72B Q4

24tok/s

M5 Max 128GB — 122B-A10B MoE Q4

61tok/s

M4 Ultra 192GB — 397B-A17B MoE Q4

18tok/s

The MoE advantage is visible throughout this chart. The 35B-A3B on a 64 GB M4 Max posts 44 tok/s — nearly as fast as the 14B dense model on the M4 Pro, despite being 2.5x larger in total parameters. At the top, the 122B-A10B on M5 Max at 61 tok/s is 2.5x faster than the dense 72B at the same memory tier. Active parameters drive throughput, not stored ones.

Setting Up: Ollama on Apple Silicon Step by Step

Ollama's Metal backend handles hardware detection automatically on Apple Silicon. There's nothing platform-specific to configure — it detects unified memory, routes inference through Metal, and uses the full pool. The commands are identical to any other platform.

1.Install Ollama from ollama.com — use the macOS installer, no manual setup needed.
2.Confirm your unified memory: System Settings → General → About. Note the Memory figure.
3.Choose your variant from the hardware guide above. When uncertain, start one size smaller and verify it runs cleanly before upgrading.
4.Pull the model: ollama pull qwen3.5:122b-a10b — the 122B Q4 is ~68 GB, so allow download time.
5.Run and test: ollama run qwen3.5:122b-a10b — enter /nothink hello world to confirm it loads and responds.
6.For API access (Open WebUI, Continue.dev, custom scripts), Ollama exposes OpenAI-compatible endpoints at http://localhost:11434.

terminalbash

# Check your unified memory
system_profiler SPHardwareDataType | grep "Memory:"

# Pull and run the 14B for 24 GB+ setups
ollama pull qwen3.5:14b
ollama run qwen3.5:14b

# Pull and run the 122B MoE for M5 Max 128 GB
ollama pull qwen3.5:122b-a10b
ollama run qwen3.5:122b-a10b

# Monitor GPU activity while model is loaded
sudo powermetrics --samplers gpu_power -i 1000 -n 5

# Test via the OpenAI-compatible API
curl http://localhost:11434/api/chat \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen3.5:14b","messages":[{"role":"user","content":"/nothink Explain MoE briefly."}]}'

MLX: Apple's Native Inference Path

For users who want every last tok/s, Apple's MLX framework offers an alternative inference path tuned specifically for the M-series chips. The mlx-lm package runs Qwen3.5 through Apple's own Metal compute shaders. In some configurations this yields 5-15% higher throughput versus llama.cpp — the gap narrows at Q4 where memory bandwidth dominates over compute efficiency, but it's worth testing on server or continuous workloads.

terminal — MLXbash

# Install MLX LM
pip install mlx-lm

# Run Qwen3.5-14B via MLX (auto-downloads 4-bit weights from Hugging Face)
mlx_lm.generate \
  --model mlx-community/Qwen3.5-14B-Instruct-4bit \
  --prompt "/nothink What is the difference between MoE and a dense model?" \
  --max-tokens 300

# For the 122B MoE on M5 Max 128 GB
mlx_lm.generate \
  --model mlx-community/Qwen3.5-122B-A10B-Instruct-4bit \
  --prompt "/think What are the tradeoffs between sparse and dense attention?" \
  --max-tokens 1000

Thinking Mode: Speed vs Quality Per Task

Qwen3.5 supports explicit thinking and non-thinking modes activated by prompt prefix. In thinking mode (/think), the model emits an extended chain-of-thought trace before producing its answer — useful for hard reasoning, math, and complex debugging. In non-thinking mode (/nothink), it answers directly without internal deliberation.

At 61 tok/s on the M5 Max, a 5,000-token thinking chain takes roughly 82 seconds. That's reasonable for problems where quality matters more than speed — algorithm design, proof-checking, multi-step debugging. For everyday conversation, code completions, and quick lookups, /nothink keeps responses arriving in under two seconds.

The /think and /nothink prefixes apply per message, not to the entire session. You can mix modes freely mid-conversation — /nothink for quick clarifications, /think when you hit something genuinely hard. The model's conversation context carries over between turns regardless of which mode each message uses.

Benchmark Snapshot: Where Qwen3.5 Actually Stands

The flagship Qwen3.5-397B-A17B (Multi-GPU or Ultra hardware) posts strong numbers across the standard evals: 88.4 on GPQA Diamond (graduate-level science), 91.3 on AIME 2026 (advanced math competition), 83.6 on LiveCodeBench v6 (real-world programming tasks), and 86.7 on Tau2-Bench (multi-step agentic execution). These compete with leading closed frontier models.

The 122B-A10B scores roughly 8-12 points lower across most benchmarks — still exceptional for anything that runs on a laptop without a cloud connection. For the Qwen3.6-35B-A3B variant (the latest 35B MoE release), SWE-bench Verified comes in at 73.4%, which exceeds what frontier-class cloud models were posting 18 months ago. Quality per local compute cost is at an all-time high.

Qwen3.5 Flagship Benchmarks (397B-A17B)

GPQA Diamond

88.4score

AIME 2026

91.3score

LiveCodeBench v6

83.6score

Tau2-Bench

86.7score

Is the Hardware Investment Worth It?

An M5 Max MacBook Pro with 128 GB lists at $4,500-5,500. At 61 tok/s on a 122B model — zero API costs, no rate limits, full data privacy, offline-capable — it's the most capable portable local AI machine available in 2026. A 10-person developer team spending $20,000-30,000 annually on cloud inference finds the payback math compelling inside 12-18 months.

If M5 Max is out of reach, the M4 Max 64 GB at around $3,000-3,500 runs Qwen3.5-35B-A3B at 44 tok/s. The M4 Pro 24 GB at $1,999 handles the 14B at 48 tok/s. Every tier of Apple Silicon now has a Qwen3.5 variant that makes real use of it. The local AI ROI calculation has never been this clear-cut.

Check which Qwen3.5 variant fits your Apple Silicon chip or GPU — weights plus KV cache at your target context length.

Open the VRAM Calculator → →

March 18, 2026

Try Runyard

Find AI models that fit your exact hardware. Enter your specs and get a ranked list instantly.

Newsletter