Contents
Tags
Llama 4 Scout is the most capable open-weight model Meta has ever shipped — and also one of the most misunderstood. Its 109 billion parameters sound terrifying until you understand how Mixture-of-Experts works: Scout only activates 17 billion of those parameters per token during inference. That changes the hardware math completely. This guide gives you exact VRAM numbers, real setup commands, and honest performance expectations for every consumer GPU tier.
Traditional dense models — Llama 3.3 70B, Qwen2.5 72B, Mistral Large — activate every single one of their parameters for every token generated. A 70B dense model burns through 70 billion weights per forward pass, every pass. Llama 4 Scout's Mixture-of-Experts design is fundamentally different.
Scout has 109B total parameters spread across 16 expert sub-networks. A learned router examines each token and selects the most relevant subset of experts — routing to exactly 17B active parameters per token regardless of input. The remaining experts sit loaded in memory but contribute zero compute to that token. You get the knowledge breadth of a 109B-class system at the inference throughput of a ~17B model.
The MoE VRAM catch that trips people up: all expert weights must be resident in VRAM (or system RAM for CPU offload) simultaneously. You cannot selectively load only the currently-active experts at inference time — the router needs to reach any of the 16 sub-networks on every forward pass. This is why Scout's VRAM footprint is sized against its 109B total parameter count, not the 17B active ones.
Scout at FP16 full precision requires 218 GB of VRAM minimum — pure data center territory. Aggressive quantization brings it into consumer range. The numbers below use GGUF-format quants compatible with llama.cpp, Ollama, and LM Studio, and include approximately 10% overhead for KV cache at an 8K context window.
At 16 GB, you're in CPU-offload territory for Scout. The 1.78-bit Unsloth quant clocks in at ~24 GB — it overflows a 16 GB card by 8 GB, meaning roughly one-third of model layers must route through system RAM on every token. Expect 2–4 tokens per second. That's acceptable for long batch tasks you start and walk away from, but not for real-time chat.
On 16 GB, your smarter plays for Scout-class quality are: the Meta Llama 4 Scout API (free tier at api.llama.com), Qwen3-30B-A3B Q4 (~18 GB — also MoE, fits cleanly in 16 GB), or Llama 3.3 70B at Q2 with mild offloading. Scout is worth saving for a GPU upgrade rather than torturing on hardware it doesn't fit.
The 1.78-bit Unsloth quant (IQ1_S variant) is the unlock for 24 GB cards. The full model fits in a single GPU's VRAM with a few gigabytes spare for KV cache. Community benchmarks show approximately 18–22 tokens per second on an RTX 4090, 14–16 tok/s on an RTX 3090, and above 25 tok/s on an RTX 5090 with its wider memory bus.
Quality at 1.78-bit is the honest question. For creative writing, summarization, and casual coding assistance, the result is genuinely impressive — competitive with many Q4 models from a year ago. For precision-demanding tasks — code that must compile first try, multi-step mathematical proofs, structured data extraction — use Q4 on a two-GPU setup or reach for the API.
This is where Scout actually sings. Q3_K_M at 48 GB fits a dual-RTX-3090 setup with headroom for a 16K context window. An A100 80 GB comfortably loads Q4_K_M at 65 GB with room for 32K+ context. Quality at Q3_K_M or Q4_K_M is close to the model's ceiling — on most real-world tasks, Q3_K_M Scout is indistinguishable from Q8.
Apple's unified memory architecture means RAM and GPU memory are the same pool. An M4 Max with 128 GB loads Scout Q4_K_M (65 GB) with 63 GB free for the OS, applications, and KV cache. Ollama now defaults to the MLX backend on Apple Silicon, handling Scout natively without extra configuration. MLX throughput on M4 Max 128 GB runs around 25–35 tok/s for prefill and 18–24 tok/s for generation — outperforming most single Nvidia consumer cards at equivalent quality.
Ollama added official Llama 4 support shortly after Meta's release. A single command pulls and runs Scout — Ollama automatically selects the best quantization for your available VRAM and routes Apple Silicon through the MLX backend.
# Pull and run Scout (Ollama picks the best quant for your VRAM)
ollama run llama4:scout
# Force a specific quantization
ollama pull llama4:scout:q4_K_M # 65 GB — for dual-GPU or A100 80GB
ollama pull llama4:scout:q3_K_M # 48 GB — for dual RTX 3090
ollama pull llama4:scout:1.78bit # 24 GB — for single RTX 3090/4090/5090
# Check which quantization Ollama auto-selected
ollama show llama4:scout
# Run with a larger context window (Scout supports up to 10M tokens)
ollama run llama4:scout --ctx 32768On Apple Silicon, Ollama routes Scout through MLX automatically — you'll see "Using MLX runner" in the log output. No extra flags or configuration needed. The MLX path delivers significantly better throughput than the llama.cpp CUDA path would on the same Apple hardware.
Scout processes images natively — no separate CLIP encoder or vision adapter. You can pass image files directly through the Ollama API. The model was jointly trained on text and images from the ground up, so it reasons about visual content in context rather than generating surface-level descriptions.
# Send an image to Scout via the Ollama REST API
curl http://localhost:11434/api/chat \
-H "Content-Type: application/json" \
-d '{
"model": "llama4:scout",
"messages": [
{
"role": "user",
"content": "What does this diagram show, and what would you improve?",
"images": ["'$(base64 -w0 /path/to/your/image.png)'"]
}
]
}'Image inference is more VRAM-hungry than text — a single high-resolution image can consume several extra gigabytes of KV cache during visual encoding. If you're running Scout at the 1.78-bit limit on a 24 GB card and hitting OOM errors with images, reduce input resolution to 1024×1024 or smaller before passing images in, or lower your --ctx window to free up KV cache headroom.
Meta shipped Llama 4 Maverick alongside Scout. Maverick is the heavy sibling: 400B total parameters, 128 experts (vs Scout's 16), and the same 17B active parameters per token. Maverick needs ~200 GB VRAM at Q4 — an eight-H100 server configuration. At 1.78-bit, Maverick still requires ~100 GB, putting it at four H100 80GB minimum.
Scout at Q4_K_M is a 65 GB download. Pulling the wrong quantization and discovering it won't fit in your VRAM wastes hours of transfer time. Before running ollama pull, verify your exact available VRAM, system RAM for potential CPU offload, and KV cache headroom at your target context length. The Runyard VRAM Calculator handles all of this in one place.
Use the Runyard VRAM Calculator to find the exact Llama 4 Scout quantization that fits your GPU — with context window headroom calculated automatically.
Open the VRAM Calculator → →Tools
Find AI models that fit your exact hardware. Enter your specs and get a ranked list instantly.
Newsletter