← Blog/Running Llama 4 Scout Locally: The Complete Hardware Guide
Runyard.dev — Find AI Models That Run on Your Hardware

Running Llama 4 Scout Locally: The Complete Hardware Guide

Circuit board and GPU hardware for running Llama 4 Scout locally
Llama 4 Scout's MoE architecture delivers 109B parameters of knowledge at 17B parameters of inference cost per token.

Llama 4 Scout is the most capable open-weight model Meta has ever shipped — and also one of the most misunderstood. Its 109 billion parameters sound terrifying until you understand how Mixture-of-Experts works: Scout only activates 17 billion of those parameters per token during inference. That changes the hardware math completely. This guide gives you exact VRAM numbers, real setup commands, and honest performance expectations for every consumer GPU tier.

Why 109B Parameters Doesn't Mean What You Think

Traditional dense models — Llama 3.3 70B, Qwen2.5 72B, Mistral Large — activate every single one of their parameters for every token generated. A 70B dense model burns through 70 billion weights per forward pass, every pass. Llama 4 Scout's Mixture-of-Experts design is fundamentally different.

Scout has 109B total parameters spread across 16 expert sub-networks. A learned router examines each token and selects the most relevant subset of experts — routing to exactly 17B active parameters per token regardless of input. The remaining experts sit loaded in memory but contribute zero compute to that token. You get the knowledge breadth of a 109B-class system at the inference throughput of a ~17B model.

  • Total parameters: 109B — all weights must be resident in VRAM or offloaded to system RAM
  • Active parameters per token: 17B — determines compute speed and generation latency
  • Expert count: 16 parallel expert networks per MoE layer
  • Context window: up to 10 million tokens on the Instruct variant
  • Modalities: text and images natively — no separate vision adapter required
  • License: Llama 4 Community License — free for commercial use under a 700M monthly active user threshold

The MoE VRAM catch that trips people up: all expert weights must be resident in VRAM (or system RAM for CPU offload) simultaneously. You cannot selectively load only the currently-active experts at inference time — the router needs to reach any of the 16 sub-networks on every forward pass. This is why Scout's VRAM footprint is sized against its 109B total parameter count, not the 17B active ones.

VRAM Requirements by Quantization

Scout at FP16 full precision requires 218 GB of VRAM minimum — pure data center territory. Aggressive quantization brings it into consumer range. The numbers below use GGUF-format quants compatible with llama.cpp, Ollama, and LM Studio, and include approximately 10% overhead for KV cache at an 8K context window.

Llama 4 Scout — VRAM Required by Quantization Level
FP16 (full precision)
218GB
Q8_0
109GB
Q4_K_M
65GB
Q3_K_M
48GB
1.78-bit (Unsloth)
24GB
  • FP16 — 218 GB: three H100 80GB GPUs minimum; pure data center terrain
  • Q8_0 — 109 GB: dual A100 80GB; best quality, fastest bandwidth-per-quality unit at this tier
  • Q4_K_M — 65 GB: dual RTX 3090 (48 GB combined is tight), single A100 80GB, or Mac with 64 GB+ unified memory
  • Q3_K_M — 48 GB: dual RTX 3090 with comfortable headroom, or Mac M4 Max 64 GB with light offload
  • 1.78-bit (Unsloth IQ1_S) — 24 GB: single RTX 3090, 4090, or 5090; real quality tradeoff but genuinely conversational speed

Hardware Tier Breakdown

16 GB — RTX 4060 Ti / RTX 4080 Super

At 16 GB, you're in CPU-offload territory for Scout. The 1.78-bit Unsloth quant clocks in at ~24 GB — it overflows a 16 GB card by 8 GB, meaning roughly one-third of model layers must route through system RAM on every token. Expect 2–4 tokens per second. That's acceptable for long batch tasks you start and walk away from, but not for real-time chat.

On 16 GB, your smarter plays for Scout-class quality are: the Meta Llama 4 Scout API (free tier at api.llama.com), Qwen3-30B-A3B Q4 (~18 GB — also MoE, fits cleanly in 16 GB), or Llama 3.3 70B at Q2 with mild offloading. Scout is worth saving for a GPU upgrade rather than torturing on hardware it doesn't fit.

24 GB — RTX 3090 / RTX 4090 / RTX 5090

The 1.78-bit Unsloth quant (IQ1_S variant) is the unlock for 24 GB cards. The full model fits in a single GPU's VRAM with a few gigabytes spare for KV cache. Community benchmarks show approximately 18–22 tokens per second on an RTX 4090, 14–16 tok/s on an RTX 3090, and above 25 tok/s on an RTX 5090 with its wider memory bus.

Llama 4 Scout 1.78-bit — Estimated Tokens/sec by GPU
RTX 5090 32GB
26tok/s
RTX 4090 24GB
20tok/s
RTX 3090 24GB
15tok/s
RTX 4080 16GB (offload)
4tok/s

Quality at 1.78-bit is the honest question. For creative writing, summarization, and casual coding assistance, the result is genuinely impressive — competitive with many Q4 models from a year ago. For precision-demanding tasks — code that must compile first try, multi-step mathematical proofs, structured data extraction — use Q4 on a two-GPU setup or reach for the API.

48–80 GB — Dual RTX 3090 / A100 80 GB

This is where Scout actually sings. Q3_K_M at 48 GB fits a dual-RTX-3090 setup with headroom for a 16K context window. An A100 80 GB comfortably loads Q4_K_M at 65 GB with room for 32K+ context. Quality at Q3_K_M or Q4_K_M is close to the model's ceiling — on most real-world tasks, Q3_K_M Scout is indistinguishable from Q8.

Apple Silicon — M3 Max 96 GB / M4 Max 128 GB

Apple's unified memory architecture means RAM and GPU memory are the same pool. An M4 Max with 128 GB loads Scout Q4_K_M (65 GB) with 63 GB free for the OS, applications, and KV cache. Ollama now defaults to the MLX backend on Apple Silicon, handling Scout natively without extra configuration. MLX throughput on M4 Max 128 GB runs around 25–35 tok/s for prefill and 18–24 tok/s for generation — outperforming most single Nvidia consumer cards at equivalent quality.

Running Scout with Ollama

Ollama added official Llama 4 support shortly after Meta's release. A single command pulls and runs Scout — Ollama automatically selects the best quantization for your available VRAM and routes Apple Silicon through the MLX backend.

terminalbash
# Pull and run Scout (Ollama picks the best quant for your VRAM)
ollama run llama4:scout

# Force a specific quantization
ollama pull llama4:scout:q4_K_M   # 65 GB — for dual-GPU or A100 80GB
ollama pull llama4:scout:q3_K_M   # 48 GB — for dual RTX 3090
ollama pull llama4:scout:1.78bit  # 24 GB — for single RTX 3090/4090/5090

# Check which quantization Ollama auto-selected
ollama show llama4:scout

# Run with a larger context window (Scout supports up to 10M tokens)
ollama run llama4:scout --ctx 32768

On Apple Silicon, Ollama routes Scout through MLX automatically — you'll see "Using MLX runner" in the log output. No extra flags or configuration needed. The MLX path delivers significantly better throughput than the llama.cpp CUDA path would on the same Apple hardware.

Multimodal Inputs: Images Work Out of the Box

Scout processes images natively — no separate CLIP encoder or vision adapter. You can pass image files directly through the Ollama API. The model was jointly trained on text and images from the ground up, so it reasons about visual content in context rather than generating surface-level descriptions.

terminalbash
# Send an image to Scout via the Ollama REST API
curl http://localhost:11434/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama4:scout",
    "messages": [
      {
        "role": "user",
        "content": "What does this diagram show, and what would you improve?",
        "images": ["'$(base64 -w0 /path/to/your/image.png)'"]
      }
    ]
  }'

Image inference is more VRAM-hungry than text — a single high-resolution image can consume several extra gigabytes of KV cache during visual encoding. If you're running Scout at the 1.78-bit limit on a 24 GB card and hitting OOM errors with images, reduce input resolution to 1024×1024 or smaller before passing images in, or lower your --ctx window to free up KV cache headroom.

Scout vs Maverick: Which One Should You Actually Run?

Meta shipped Llama 4 Maverick alongside Scout. Maverick is the heavy sibling: 400B total parameters, 128 experts (vs Scout's 16), and the same 17B active parameters per token. Maverick needs ~200 GB VRAM at Q4 — an eight-H100 server configuration. At 1.78-bit, Maverick still requires ~100 GB, putting it at four H100 80GB minimum.

  • Scout (109B total, 16 experts): designed for local inference — single high-end consumer GPU at 1.78-bit, dual consumer GPU at Q4
  • Maverick (400B total, 128 experts): meaningfully stronger on hard multi-step reasoning and complex multimodal tasks
  • On everyday tasks — coding, writing, document analysis, Q&A — Scout Q4 and Maverick Q4 score within ~5% of each other on evals
  • Maverick's advantage shows on hard reasoning chains, expert-level science and math queries, and very long context tasks
  • For personal and small-team local deployments: Scout is the right call in almost every scenario

Confirm Your Numbers Before the 65 GB Download

Scout at Q4_K_M is a 65 GB download. Pulling the wrong quantization and discovering it won't fit in your VRAM wastes hours of transfer time. Before running ollama pull, verify your exact available VRAM, system RAM for potential CPU offload, and KV cache headroom at your target context length. The Runyard VRAM Calculator handles all of this in one place.

Use the Runyard VRAM Calculator to find the exact Llama 4 Scout quantization that fits your GPU — with context window headroom calculated automatically.

Open the VRAM Calculator →

RUNYARD.DEV

Hardware-aware AI model discovery. Know exactly what runs on your machine — before you download.

© 2026 RUNYARD.DEV — All rights reserved.

Built for local AI.

Tools

Try Runyard

Find AI models that fit your exact hardware. Enter your specs and get a ranked list instantly.

Newsletter