Runyard is a free hardware-aware AI model browser. You enter your CPU, GPU, and VRAM and it instantly shows every local LLM that will run on your machine, ranked by speed and quality.

How much VRAM do I need to run local LLMs?

8GB of VRAM runs 7B models like Llama 3.1 8B and Mistral 7B at Q4 quantization. 16GB unlocks 13B models. 24GB lets you run Mixtral 8x7B and Llama 3 70B at lower quantization.

What is the best local LLM for my GPU?

Use Runyard at www.runyard.dev — enter your GPU and VRAM and the Model Radar will rank every compatible LLM for your exact hardware, showing estimated tokens per second for each model.

Can I run Llama 3 locally?

Yes. Llama 3.1 8B at Q4 runs on any 8GB VRAM GPU. Llama 3.1 70B needs around 40GB VRAM at Q4, or an Apple Silicon Mac with 64GB+ unified memory.

← Blog/Running Llama 4 Scout Locally: The Complete Hardware Guide

May 5, 2026guide

Runyard Team

@runyard_dev

12 min read

Contents

▸Why 109B Parameters Doesn't Mean What You Think ▸VRAM Requirements by Quantization ▸Hardware Tier Breakdown ▸Running Scout with Ollama ▸Multimodal Inputs: Images Work Out of the Box ▸Scout vs Maverick: Which One Should You Actually Run?▸Confirm Your Numbers Before the 65 GB Download

Running Llama 4 Scout Locally: The Complete Hardware Guide

Circuit board and GPU hardware for running Llama 4 Scout locally — Llama 4 Scout's MoE architecture delivers 109B parameters of knowledge at 17B parameters of inference cost per token.

Llama 4 Scout is the most capable open-weight model Meta has ever shipped — and also one of the most misunderstood. Its 109 billion parameters sound terrifying until you understand how Mixture-of-Experts works: Scout only activates 17 billion of those parameters per token during inference. That changes the hardware math completely. This guide gives you exact VRAM numbers, real setup commands, and honest performance expectations for every consumer GPU tier.

Why 109B Parameters Doesn't Mean What You Think

Traditional dense models — Llama 3.3 70B, Qwen2.5 72B, Mistral Large — activate every single one of their parameters for every token generated. A 70B dense model burns through 70 billion weights per forward pass, every pass. Llama 4 Scout's Mixture-of-Experts design is fundamentally different.

Scout has 109B total parameters spread across 16 expert sub-networks. A learned router examines each token and selects the most relevant subset of experts — routing to exactly 17B active parameters per token regardless of input. The remaining experts sit loaded in memory but contribute zero compute to that token. You get the knowledge breadth of a 109B-class system at the inference throughput of a ~17B model.

▸Total parameters: 109B — all weights must be resident in VRAM or offloaded to system RAM
▸Active parameters per token: 17B — determines compute speed and generation latency
▸Expert count: 16 parallel expert networks per MoE layer
▸Context window: up to 10 million tokens on the Instruct variant
▸Modalities: text and images natively — no separate vision adapter required
▸License: Llama 4 Community License — free for commercial use under a 700M monthly active user threshold

The MoE VRAM catch that trips people up: all expert weights must be resident in VRAM (or system RAM for CPU offload) simultaneously. You cannot selectively load only the currently-active experts at inference time — the router needs to reach any of the 16 sub-networks on every forward pass. This is why Scout's VRAM footprint is sized against its 109B total parameter count, not the 17B active ones.

VRAM Requirements by Quantization

Scout at FP16 full precision requires 218 GB of VRAM minimum — pure data center territory. Aggressive quantization brings it into consumer range. The numbers below use GGUF-format quants compatible with llama.cpp, Ollama, and LM Studio, and include approximately 10% overhead for KV cache at an 8K context window.

Llama 4 Scout — VRAM Required by Quantization Level

FP16 (full precision)

218GB

Q8_0

109GB

Q4_K_M

65GB

Q3_K_M

48GB

1.78-bit (Unsloth)

24GB

▸FP16 — 218 GB: three H100 80GB GPUs minimum; pure data center terrain
▸Q8_0 — 109 GB: dual A100 80GB; best quality, fastest bandwidth-per-quality unit at this tier
▸Q4_K_M — 65 GB: dual RTX 3090 (48 GB combined is tight), single A100 80GB, or Mac with 64 GB+ unified memory
▸Q3_K_M — 48 GB: dual RTX 3090 with comfortable headroom, or Mac M4 Max 64 GB with light offload
▸1.78-bit (Unsloth IQ1_S) — 24 GB: single RTX 3090, 4090, or 5090; real quality tradeoff but genuinely conversational speed

Hardware Tier Breakdown

16 GB — RTX 4060 Ti / RTX 4080 Super

At 16 GB, you're in CPU-offload territory for Scout. The 1.78-bit Unsloth quant clocks in at ~24 GB — it overflows a 16 GB card by 8 GB, meaning roughly one-third of model layers must route through system RAM on every token. Expect 2–4 tokens per second. That's acceptable for long batch tasks you start and walk away from, but not for real-time chat.

On 16 GB, your smarter plays for Scout-class quality are: the Meta Llama 4 Scout API (free tier at api.llama.com), Qwen3-30B-A3B Q4 (~18 GB — also MoE, fits cleanly in 16 GB), or Llama 3.3 70B at Q2 with mild offloading. Scout is worth saving for a GPU upgrade rather than torturing on hardware it doesn't fit.

24 GB — RTX 3090 / RTX 4090 / RTX 5090

The 1.78-bit Unsloth quant (IQ1_S variant) is the unlock for 24 GB cards. The full model fits in a single GPU's VRAM with a few gigabytes spare for KV cache. Community benchmarks show approximately 18–22 tokens per second on an RTX 4090, 14–16 tok/s on an RTX 3090, and above 25 tok/s on an RTX 5090 with its wider memory bus.

Llama 4 Scout 1.78-bit — Estimated Tokens/sec by GPU

RTX 5090 32GB

26tok/s

RTX 4090 24GB

20tok/s

RTX 3090 24GB

15tok/s

RTX 4080 16GB (offload)

4tok/s

Quality at 1.78-bit is the honest question. For creative writing, summarization, and casual coding assistance, the result is genuinely impressive — competitive with many Q4 models from a year ago. For precision-demanding tasks — code that must compile first try, multi-step mathematical proofs, structured data extraction — use Q4 on a two-GPU setup or reach for the API.

48–80 GB — Dual RTX 3090 / A100 80 GB

This is where Scout actually sings. Q3_K_M at 48 GB fits a dual-RTX-3090 setup with headroom for a 16K context window. An A100 80 GB comfortably loads Q4_K_M at 65 GB with room for 32K+ context. Quality at Q3_K_M or Q4_K_M is close to the model's ceiling — on most real-world tasks, Q3_K_M Scout is indistinguishable from Q8.

Apple Silicon — M3 Max 96 GB / M4 Max 128 GB

Apple's unified memory architecture means RAM and GPU memory are the same pool. An M4 Max with 128 GB loads Scout Q4_K_M (65 GB) with 63 GB free for the OS, applications, and KV cache. Ollama now defaults to the MLX backend on Apple Silicon, handling Scout natively without extra configuration. MLX throughput on M4 Max 128 GB runs around 25–35 tok/s for prefill and 18–24 tok/s for generation — outperforming most single Nvidia consumer cards at equivalent quality.

Running Scout with Ollama

Ollama added official Llama 4 support shortly after Meta's release. A single command pulls and runs Scout — Ollama automatically selects the best quantization for your available VRAM and routes Apple Silicon through the MLX backend.

terminalbash

# Pull and run Scout (Ollama picks the best quant for your VRAM)
ollama run llama4:scout

# Force a specific quantization
ollama pull llama4:scout:q4_K_M   # 65 GB — for dual-GPU or A100 80GB
ollama pull llama4:scout:q3_K_M   # 48 GB — for dual RTX 3090
ollama pull llama4:scout:1.78bit  # 24 GB — for single RTX 3090/4090/5090

# Check which quantization Ollama auto-selected
ollama show llama4:scout

# Run with a larger context window (Scout supports up to 10M tokens)
ollama run llama4:scout --ctx 32768

On Apple Silicon, Ollama routes Scout through MLX automatically — you'll see "Using MLX runner" in the log output. No extra flags or configuration needed. The MLX path delivers significantly better throughput than the llama.cpp CUDA path would on the same Apple hardware.

Multimodal Inputs: Images Work Out of the Box

Scout processes images natively — no separate CLIP encoder or vision adapter. You can pass image files directly through the Ollama API. The model was jointly trained on text and images from the ground up, so it reasons about visual content in context rather than generating surface-level descriptions.

terminalbash

# Send an image to Scout via the Ollama REST API
curl http://localhost:11434/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama4:scout",
    "messages": [
      {
        "role": "user",
        "content": "What does this diagram show, and what would you improve?",
        "images": ["'$(base64 -w0 /path/to/your/image.png)'"]
      }
    ]
  }'

Image inference is more VRAM-hungry than text — a single high-resolution image can consume several extra gigabytes of KV cache during visual encoding. If you're running Scout at the 1.78-bit limit on a 24 GB card and hitting OOM errors with images, reduce input resolution to 1024×1024 or smaller before passing images in, or lower your --ctx window to free up KV cache headroom.

Scout vs Maverick: Which One Should You Actually Run?

Meta shipped Llama 4 Maverick alongside Scout. Maverick is the heavy sibling: 400B total parameters, 128 experts (vs Scout's 16), and the same 17B active parameters per token. Maverick needs ~200 GB VRAM at Q4 — an eight-H100 server configuration. At 1.78-bit, Maverick still requires ~100 GB, putting it at four H100 80GB minimum.

▸Scout (109B total, 16 experts): designed for local inference — single high-end consumer GPU at 1.78-bit, dual consumer GPU at Q4
▸Maverick (400B total, 128 experts): meaningfully stronger on hard multi-step reasoning and complex multimodal tasks
▸On everyday tasks — coding, writing, document analysis, Q&A — Scout Q4 and Maverick Q4 score within ~5% of each other on evals
▸Maverick's advantage shows on hard reasoning chains, expert-level science and math queries, and very long context tasks
▸For personal and small-team local deployments: Scout is the right call in almost every scenario

Confirm Your Numbers Before the 65 GB Download

Scout at Q4_K_M is a 65 GB download. Pulling the wrong quantization and discovering it won't fit in your VRAM wastes hours of transfer time. Before running ollama pull, verify your exact available VRAM, system RAM for potential CPU offload, and KV cache headroom at your target context length. The Runyard VRAM Calculator handles all of this in one place.

Use the Runyard VRAM Calculator to find the exact Llama 4 Scout quantization that fits your GPU — with context window headroom calculated automatically.

Open the VRAM Calculator → →

March 18, 2026

Try Runyard

Find AI models that fit your exact hardware. Enter your specs and get a ranked list instantly.

Newsletter