Runyard is a free hardware-aware AI model browser. You enter your CPU, GPU, and VRAM and it instantly shows every local LLM that will run on your machine, ranked by speed and quality.

How much VRAM do I need to run local LLMs?

8GB of VRAM runs 7B models like Llama 3.1 8B and Mistral 7B at Q4 quantization. 16GB unlocks 13B models. 24GB lets you run Mixtral 8x7B and Llama 3 70B at lower quantization.

What is the best local LLM for my GPU?

Use Runyard at www.runyard.dev — enter your GPU and VRAM and the Model Radar will rank every compatible LLM for your exact hardware, showing estimated tokens per second for each model.

Can I run Llama 3 locally?

Yes. Llama 3.1 8B at Q4 runs on any 8GB VRAM GPU. Llama 3.1 70B needs around 40GB VRAM at Q4, or an Apple Silicon Mac with 64GB+ unified memory.

← Blog/How Local LLM Inference Actually Works: Loading, Memory, and Quantization Explained

April 29, 2026deep-dive

Runyard Team

@runyard_dev

12 min read

Contents

▸An LLM Is Not an Executable — It's a Recipe ▸The Main Inference Engines — and What Separates Them ▸The Four Phases of Inference ▸How Models Load: MMAP Is the Secret ▸Memory Hierarchy: SSD → RAM → VRAM ▸Quantization: The Core Trade-Off ▸GGUF Still Dominates — Here's Why ▸The Loading → Prefill → Decode Pipeline in Practice ▸What This Means for Choosing Your Setup ▸What's Coming Next: Prefill, Decoding, and KV Cache

How Local LLM Inference Actually Works: Loading, Memory, and Quantization Explained

Server racks and memory chips powering AI inference — Running a local LLM is more than clicking a file — it's a pipeline spanning SSD, RAM, and GPU.

Most guides tell you to run `ollama pull` and start chatting. But what actually happens between downloading a model and seeing the first token appear on screen? The answer involves memory hierarchies, operating system tricks, and a surprising truth: the programming language an inference engine is written in barely matters. Here's a plain-English breakdown of how local LLM inference really works.

An LLM Is Not an Executable — It's a Recipe

When you download a model from Hugging Face, you don't get a single runnable program. You get a collection of artifact files — each serving a different role in describing and running the model.

▸model.safetensors — The big one. Often 5–15 GB. Stores the model's weights in a large structured format (essentially a JSON-like file containing billions of floating-point numbers).
▸config.json — The blueprint. Describes the entire architecture: number of layers, attention heads, attention mechanism type, vocabulary size, and much more.
▸tokenizer files — Convert your text into token IDs the model understands, and convert output token IDs back into text.
▸generation_config.json — Default sampling parameters: temperature, top-p, repetition penalties.

Think of it like a recipe: you have ingredients (weights), a method (architecture config), and a translation guide (tokenizer). An inference engine is the chef that reads the recipe and actually cooks the dish. And just like different chefs have different techniques, different engines have different opinions on the best way to serve the model.

The Main Inference Engines — and What Separates Them

There are several popular inference engines, each written in different languages and optimised for different use cases.

Inference Engine Primary Language

llama.cpp (C++)

95relative complexity

vLLM (Python)

88relative complexity

SGLang (Python)

85relative complexity

TGI (Rust + Python)

82relative complexity

TensorRT-LLM (C++ + Python)

90relative complexity

▸llama.cpp — Written in C++. Lightweight, portable, and exceptional at running models on hardware with limited VRAM by splitting weights across RAM and GPU.
▸vLLM — Primarily Python. Designed for high-throughput server deployments. Slower to load but faster for batched concurrent requests.
▸SGLang — Python-based. Built around structured generation and complex multi-step LLM programs.
▸TGI (Text Generation Inference) — Rust + C++ + Python stack from Hugging Face. Great for production API deployments.
▸TensorRT-LLM — NVIDIA's engine. Uses C++ and Python with deep hardware-specific optimizations for H100 and Blackwell GPUs.

Counterintuitive fact: vLLM, written mostly in Python, outperforms llama.cpp on tokens-per-second for concurrent batched requests — despite Python being slower than C++. The bottleneck is GPU kernel efficiency and memory scheduling, not the host language.

Programming language overhead in inference is largely irrelevant. What matters is how efficiently the engine schedules GPU kernels, manages the KV cache, and batches concurrent requests. A well-optimised Python wrapper around efficient CUDA kernels will beat poorly optimised C++ every time.

The Four Phases of Inference

Inference isn't a single operation. It flows through four distinct phases:

1.Loading — Reading the model weights from SSD into RAM (and optionally into GPU VRAM). This is where MMAP and quantization matter most.
2.Prefill — Processing your input prompt all at once, in parallel. The model reads your entire question before generating a single word.
3.Decoding — Generating one token at a time, auto-regressively. Each new token depends on all previous tokens.
4.Serving — Managing concurrent requests, batching, scheduling, and returning results to the caller.

Today we focus on loading — the phase that determines how fast you go from "model downloaded" to "first token generated."

How Models Load: MMAP Is the Secret

The naive way to load a 15 GB model file is to read it entirely from SSD into RAM, then copy those weights into GPU VRAM. This has two problems: it temporarily doubles your memory usage (you need 30 GB to load a 15 GB model), and it's slow.

Most inference engines — especially llama.cpp — avoid this using MMAP (memory-mapped files). Instead of copying the file into RAM eagerly, the OS maps the file's location on disk to a logical address in memory. The weights are only actually loaded when the inference engine accesses them. If RAM pressure forces those pages out, the OS re-loads them from disk on demand.

▸No duplicate memory usage during load — the OS manages pages lazily, not eagerly.
▸Model load time drops dramatically — llama.cpp can start generating the first token in under 10 seconds on a 15 GB model.
▸Automatic memory sharing — if multiple processes use the same model, the OS may share physical pages between them.
▸Cost of eviction — if ~5% of a 15 GB model is evicted from RAM, that's about 750 MB. At PCIe 4 NVMe speeds (~7 GB/s), reloading it takes roughly 107 milliseconds. Acceptable in most cases.

vLLM also supports MMAP, but its load time is longer — often several minutes — because it compiles the model graph, initialises its custom CUDA kernels, and sets up scheduling infrastructure needed for efficient concurrent request handling. You pay the startup cost upfront for faster runtime throughput later.

Memory Hierarchy: SSD → RAM → VRAM

Every inference run is a race through memory tiers. Bandwidth increases as you move up the hierarchy — and price per GB increases too.

Approximate Memory Bandwidth by Tier

NVMe SSD (PCIe 4)

7GB/s

DDR5 System RAM

80GB/s

RTX 4090 VRAM (GDDR6X)

1008GB/s

H100 HBM3 VRAM

3350GB/s

llama.cpp is especially good at "bunk bed" loading — splitting the model across RAM and GPU VRAM. Layers that fit in VRAM run at GPU speed; overflow layers run in RAM on the CPU. You lose some speed on the CPU layers, but you can run much larger models than your GPU alone could hold.

In llama.cpp, the `--n-gpu-layers` flag controls how many transformer layers are offloaded to GPU. Set it to 99 to push everything to GPU, or tune it to fit your available VRAM while keeping the rest in RAM.

Quantization: The Core Trade-Off

Model weights are stored in BF16 (16-bit brain float) by default. A 7B parameter model at BF16 needs about 14 GB of VRAM. Quantization compresses those weights to lower precision — 8-bit, 6-bit, 5-bit, or 4-bit — to dramatically reduce memory usage at the cost of some accuracy.

Think of it like image compression: going from 4K to 1080p. Most information is preserved, but fine detail is lost. For LLM weights, most of the semantic information survives quantization surprisingly well.

RTN: The Naive Baseline

RTN (Round to Nearest) is the simplest quantization approach. Take a group of weights, find the min/max value, normalize everything to that range, and round each value to the nearest representable integer at the lower precision.

▸Symmetric quantization (Q4_0 in GGUF) — centers the range at zero, stores one scale value per group.
▸Asymmetric quantization (Q4_1 in GGUF) — shifts the range using a bias, stores scale + bias per group. Better for skewed distributions.
▸Problem — uniform rounding means "important" weights get the same treatment as unimportant ones. Some values that matter a lot for model quality get rounded away.

K-Quants: Hierarchical Scaling

K-quants (Q4_K_S, Q4_K_M, Q5_K_M etc.) add a two-level scaling hierarchy. Instead of grouping 32 weights and quantizing them together, you group 256 weights into 8 sub-groups of 32. Each sub-group has a local scale; all 8 sub-groups share a global scale.

▸Local + global scale = anomalies within a sub-group are still preserved relative to the larger group context.
▸Mixed precision — different parts of the model architecture (attention, FFN, embeddings, normalization) have different sensitivity to quantization. K-quant M variants quantize some critical parts at higher bit-depth (e.g., 6-bit output projections vs 4-bit FFN gates).
▸Q4_K_M is currently the most popular GGUF format — good balance of compression and quality degradation.

AWQ: Finding the Important Weights First

AWQ (Activation-aware Weight Quantization) takes a smarter approach. Before quantizing anything, it runs a calibration dataset through the model to identify "salient weights" — weights that have a large impact on output quality, identified by high activation magnitudes.

Those salient weights are then scaled up before quantization (so rounding errors affect them less), then scaled back down after. Result: the important weights survive with less error, and the unimportant ones take the rounding loss instead.

EXL2 and EXL3: Mixed Precision with Hessian Sensitivity

EXL2 goes further. It also finds salient weights — but instead of scaling them, it stores different weight groups at different bit-depths. Important groups get 5–6 bits; unimportant groups get 2–3 bits. The sensitivity analysis uses the Hessian matrix (second derivative of loss with respect to weights), which identifies exactly which weights most affect the output.

In benchmarks comparing llama-2 13B models, EXL2 achieves the highest tokens-per-second while maintaining the lowest perplexity (a measure of output quality) — beating both GGUF and AWQ at comparable compression ratios. EXL3 is a newer iteration with further improvements.

Hardware-Specific Formats: FP8 and MXFP4

Some quantization formats are tied to specific GPU architectures. FP8 (8-bit floating point) is natively supported in NVIDIA Hopper architecture cards (H100, H200). MXFP4 (4-bit microscaling float) is supported in Blackwell chips (RTX 5090, B200). These run quantized operations directly in hardware — no software emulation.

▸FP8 on Hopper — Roughly 2× the throughput of FP16 inference, with minimal quality loss on well-calibrated models.
▸MXFP4 on Blackwell — NVIDIA claims 4× the throughput of FP16. Targeted at data center workloads.
▸Consumer relevance — These only matter if you have the right GPU. On most home hardware, GGUF formats (K-quants) remain the practical default.

GGUF Still Dominates — Here's Why

Despite EXL2's performance advantages, GGUF remains the most widely used format for local LLM inference. The reason is straightforward: memory constraints.

Most consumer GPU cards top out at 12–24 GB VRAM. Many home users don't have a dedicated GPU at all. GGUF's killer feature is llama.cpp's hybrid RAM/VRAM offloading — you can run a 70B model on a machine with 8 GB VRAM and 32 GB RAM by keeping most layers in RAM and only the hot layers on GPU. EXL2 doesn't support this hybrid mode as gracefully.

Quantization Format Comparison (7B Model)

BF16 (full precision)

14approximate VRAM (GB)

Q8_0

7.2approximate VRAM (GB)

Q5_K_M

5approximate VRAM (GB)

Q4_K_M

4.5approximate VRAM (GB)

Q3_K_M

3.5approximate VRAM (GB)

Q2_K

2.8approximate VRAM (GB)

The Loading → Prefill → Decode Pipeline in Practice

Here's a concrete example to tie it together. You download Qwen 3.5 7B in Q4_K_M GGUF format (~4.5 GB) and run it with llama.cpp on a machine with an RTX 3060 (12 GB VRAM):

1.Load — llama.cpp opens the GGUF file using MMAP. The OS maps the 4.5 GB file into virtual address space. Weights are loaded lazily as the engine initialises each layer. With 12 GB VRAM available, all layers fit on GPU. Time: ~8 seconds.
2.Prefill — You type a 200-token prompt. The model processes all 200 tokens simultaneously in a single forward pass. GPU runs at near-peak utilization. Duration: ~1–2 seconds.
3.Decode — The model generates tokens one by one. Each step does one forward pass through the full network. With Q4_K_M on an RTX 3060, expect 40–70 tokens/second. A 500-token response takes ~10 seconds.
4.KV Cache — Every decoded token's key/value activations are stored in a cache so they don't need to be recomputed. This is why longer conversations gradually increase VRAM usage.

terminalbash

# Run Qwen 3.5 7B with all layers on GPU in llama.cpp
./llama-cli \
  -m qwen3.5-7b-q4_k_m.gguf \
  -n 512 \           # max tokens to generate
  --n-gpu-layers 99 \ # offload all layers to GPU
  --ctx-size 4096 \   # context window size
  -p "Explain how MMAP works in simple terms"

# Hybrid mode: 20 layers on GPU, rest in RAM (for VRAM-limited setups)
./llama-cli -m qwen3.5-7b-q4_k_m.gguf --n-gpu-layers 20 -p "..."

What This Means for Choosing Your Setup

Now that you understand the pipeline, here's how to translate it into practical decisions:

▸If you have limited VRAM (4–8 GB) — Use GGUF with llama.cpp. Hybrid RAM/VRAM offloading is your best friend. Pick Q4_K_M or Q5_K_M.
▸If you're running a server for multiple users — vLLM or SGLang with FP16 or AWQ. Pay the startup cost for much higher concurrent throughput.
▸If you have a high-end NVIDIA GPU (A100, H100) — FP8 with TensorRT-LLM for maximum tokens-per-second.
▸If you want maximum quality per GB — EXL2 or EXL3 on a GPU-only setup with enough VRAM to hold the full model.
▸If you're on Apple Silicon — llama.cpp with Metal backend, or Ollama (which uses llama.cpp under the hood). Unified memory makes RAM/VRAM splitting seamless.

The best setup is the one that fits your VRAM. Use the Runyard VRAM Calculator at www.runyard.dev/tools/vram-calculator to find which models and quantization levels fit your exact GPU, and see estimated tokens-per-second before you download anything.

What's Coming Next: Prefill, Decoding, and KV Cache

Loading is only the first phase. Once the model is in memory, the real complexity begins: how prefill works for long contexts, why decoding is memory-bandwidth-bound rather than compute-bound, how KV cache management affects throughput, speculative decoding, and how schedulers batch requests to maximise GPU utilization.

Those phases each deserve their own deep dive — because there's just as much engineering depth in decoding as there is in loading. For now, understanding MMAP, the memory hierarchy, and how quantization formats preserve weight quality gives you a solid mental model for why inference behaves the way it does on your hardware.

See exactly which models and quantization levels your GPU can handle — with estimated speed.

Open the VRAM Calculator → →

March 18, 2026

Try Runyard

Find AI models that fit your exact hardware. Enter your specs and get a ranked list instantly.

Newsletter