Contents
Tags
Most guides tell you to run `ollama pull` and start chatting. But what actually happens between downloading a model and seeing the first token appear on screen? The answer involves memory hierarchies, operating system tricks, and a surprising truth: the programming language an inference engine is written in barely matters. Here's a plain-English breakdown of how local LLM inference really works.
When you download a model from Hugging Face, you don't get a single runnable program. You get a collection of artifact files — each serving a different role in describing and running the model.
Think of it like a recipe: you have ingredients (weights), a method (architecture config), and a translation guide (tokenizer). An inference engine is the chef that reads the recipe and actually cooks the dish. And just like different chefs have different techniques, different engines have different opinions on the best way to serve the model.
There are several popular inference engines, each written in different languages and optimised for different use cases.
Counterintuitive fact: vLLM, written mostly in Python, outperforms llama.cpp on tokens-per-second for concurrent batched requests — despite Python being slower than C++. The bottleneck is GPU kernel efficiency and memory scheduling, not the host language.
Programming language overhead in inference is largely irrelevant. What matters is how efficiently the engine schedules GPU kernels, manages the KV cache, and batches concurrent requests. A well-optimised Python wrapper around efficient CUDA kernels will beat poorly optimised C++ every time.
Inference isn't a single operation. It flows through four distinct phases:
Today we focus on loading — the phase that determines how fast you go from "model downloaded" to "first token generated."
The naive way to load a 15 GB model file is to read it entirely from SSD into RAM, then copy those weights into GPU VRAM. This has two problems: it temporarily doubles your memory usage (you need 30 GB to load a 15 GB model), and it's slow.
Most inference engines — especially llama.cpp — avoid this using MMAP (memory-mapped files). Instead of copying the file into RAM eagerly, the OS maps the file's location on disk to a logical address in memory. The weights are only actually loaded when the inference engine accesses them. If RAM pressure forces those pages out, the OS re-loads them from disk on demand.
vLLM also supports MMAP, but its load time is longer — often several minutes — because it compiles the model graph, initialises its custom CUDA kernels, and sets up scheduling infrastructure needed for efficient concurrent request handling. You pay the startup cost upfront for faster runtime throughput later.
Every inference run is a race through memory tiers. Bandwidth increases as you move up the hierarchy — and price per GB increases too.
llama.cpp is especially good at "bunk bed" loading — splitting the model across RAM and GPU VRAM. Layers that fit in VRAM run at GPU speed; overflow layers run in RAM on the CPU. You lose some speed on the CPU layers, but you can run much larger models than your GPU alone could hold.
In llama.cpp, the `--n-gpu-layers` flag controls how many transformer layers are offloaded to GPU. Set it to 99 to push everything to GPU, or tune it to fit your available VRAM while keeping the rest in RAM.
Model weights are stored in BF16 (16-bit brain float) by default. A 7B parameter model at BF16 needs about 14 GB of VRAM. Quantization compresses those weights to lower precision — 8-bit, 6-bit, 5-bit, or 4-bit — to dramatically reduce memory usage at the cost of some accuracy.
Think of it like image compression: going from 4K to 1080p. Most information is preserved, but fine detail is lost. For LLM weights, most of the semantic information survives quantization surprisingly well.
RTN (Round to Nearest) is the simplest quantization approach. Take a group of weights, find the min/max value, normalize everything to that range, and round each value to the nearest representable integer at the lower precision.
K-quants (Q4_K_S, Q4_K_M, Q5_K_M etc.) add a two-level scaling hierarchy. Instead of grouping 32 weights and quantizing them together, you group 256 weights into 8 sub-groups of 32. Each sub-group has a local scale; all 8 sub-groups share a global scale.
AWQ (Activation-aware Weight Quantization) takes a smarter approach. Before quantizing anything, it runs a calibration dataset through the model to identify "salient weights" — weights that have a large impact on output quality, identified by high activation magnitudes.
Those salient weights are then scaled up before quantization (so rounding errors affect them less), then scaled back down after. Result: the important weights survive with less error, and the unimportant ones take the rounding loss instead.
EXL2 goes further. It also finds salient weights — but instead of scaling them, it stores different weight groups at different bit-depths. Important groups get 5–6 bits; unimportant groups get 2–3 bits. The sensitivity analysis uses the Hessian matrix (second derivative of loss with respect to weights), which identifies exactly which weights most affect the output.
In benchmarks comparing llama-2 13B models, EXL2 achieves the highest tokens-per-second while maintaining the lowest perplexity (a measure of output quality) — beating both GGUF and AWQ at comparable compression ratios. EXL3 is a newer iteration with further improvements.
Some quantization formats are tied to specific GPU architectures. FP8 (8-bit floating point) is natively supported in NVIDIA Hopper architecture cards (H100, H200). MXFP4 (4-bit microscaling float) is supported in Blackwell chips (RTX 5090, B200). These run quantized operations directly in hardware — no software emulation.
Despite EXL2's performance advantages, GGUF remains the most widely used format for local LLM inference. The reason is straightforward: memory constraints.
Most consumer GPU cards top out at 12–24 GB VRAM. Many home users don't have a dedicated GPU at all. GGUF's killer feature is llama.cpp's hybrid RAM/VRAM offloading — you can run a 70B model on a machine with 8 GB VRAM and 32 GB RAM by keeping most layers in RAM and only the hot layers on GPU. EXL2 doesn't support this hybrid mode as gracefully.
Here's a concrete example to tie it together. You download Qwen 3.5 7B in Q4_K_M GGUF format (~4.5 GB) and run it with llama.cpp on a machine with an RTX 3060 (12 GB VRAM):
# Run Qwen 3.5 7B with all layers on GPU in llama.cpp
./llama-cli \
-m qwen3.5-7b-q4_k_m.gguf \
-n 512 \ # max tokens to generate
--n-gpu-layers 99 \ # offload all layers to GPU
--ctx-size 4096 \ # context window size
-p "Explain how MMAP works in simple terms"
# Hybrid mode: 20 layers on GPU, rest in RAM (for VRAM-limited setups)
./llama-cli -m qwen3.5-7b-q4_k_m.gguf --n-gpu-layers 20 -p "..."Now that you understand the pipeline, here's how to translate it into practical decisions:
The best setup is the one that fits your VRAM. Use the Runyard VRAM Calculator at www.runyard.dev/tools/vram-calculator to find which models and quantization levels fit your exact GPU, and see estimated tokens-per-second before you download anything.
Loading is only the first phase. Once the model is in memory, the real complexity begins: how prefill works for long contexts, why decoding is memory-bandwidth-bound rather than compute-bound, how KV cache management affects throughput, speculative decoding, and how schedulers batch requests to maximise GPU utilization.
Those phases each deserve their own deep dive — because there's just as much engineering depth in decoding as there is in loading. For now, understanding MMAP, the memory hierarchy, and how quantization formats preserve weight quality gives you a solid mental model for why inference behaves the way it does on your hardware.
See exactly which models and quantization levels your GPU can handle — with estimated speed.
Open the VRAM Calculator → →Tools
Find AI models that fit your exact hardware. Enter your specs and get a ranked list instantly.
Newsletter