Contents
Tags
A 16-bit Llama-3.1-8B weighs 14.96 GB. The Q4_K_M build of the exact same model is 4.57 GB — a 69% cut — and on WikiText its perplexity moves from 7.32 to 7.56, roughly a quarter of a point. That single comparison is the whole quantization debate: you give up a sliver of measurable quality and get two-thirds of your disk and VRAM back. The real question is not whether quantization works. It is where on the ladder from Q4_K_M up to Q8_0 the loss starts to matter for the work you actually do.
GGUF quantization names look cryptic but decode cleanly. The number is the target bit-width per weight: Q4 is roughly four bits, Q5 five, Q6 six, Q8 eight. The `_K` marks a k-quant, the format llama.cpp has used since 2023, which stores weights in super-blocks with a two-level scale hierarchy instead of one flat scale per block. That hierarchy is why a 4-bit k-quant holds up far better than the old Q4_0 legacy format at the same size.
The trailing `_S`, `_M`, and `_L` are mixing recipes, not separate bit-widths. Q4_K_M keeps a few sensitive tensors — the attention output and feed-forward down-projections — at 6 bits while the rest sit at 4, which is why a 'four-bit' model lands closer to 4.8 effective bits per weight. Q4_K_S drops those exceptions to stay smaller. Q5_K_M and Q6_K apply the same idea at higher base precision. If you want the mechanics of how these weights load and run, the Runyard explainer on [how local LLM inference actually works](/blog/how-local-llm-inference-actually-works) walks through k-quants and mixed precision in detail.
This matters because the naming hides the real ladder. People assume Q4 to Q8 is a smooth linear trade, but the useful gradient lives between Q3 and Q5. Above Q5 you are paying for bits that barely move quality; below Q4 you fall off a cliff. The whole reason to understand the suffixes is to find that narrow band — typically Q4_K_M to Q5_K_M — where each extra gigabyte still buys measurable accuracy. Everything outside it is either wasted VRAM or false economy.
The most useful recent reference point is a 2026 unified evaluation that ran every common llama.cpp quant on Llama-3.1-8B-Instruct against an FP16 baseline, measuring WikiText perplexity alongside MMLU, GSM8K, and IFEval. Perplexity is the cleanest single signal because it is continuous and reproducible — lower is better, and the FP16 model sets the floor at 7.32.
Read top-down, the curve is flat where it counts. Q8_0 (7.33) and Q6_K (7.35) are statistically indistinguishable from full precision. Q5_K_M at 7.40 costs you 0.08 points. Q4_K_M at 7.56 costs 0.24 — still small enough that blind A/B testing on ordinary prompts rarely catches it. The cliff is below four bits: Q3_K_M jumps to 7.96, an 0.64-point degradation that shows up as noticeably worse instruction-following and more frequent factual slips.
Perplexity is what you pay; disk and VRAM are what you buy. The same eval measured exact file sizes, and the spread is dramatic because precision scales the weight bytes almost linearly. This is the number that decides whether a model fits your card at all.
Q4_K_M's 4.57 GB is the reason it became the default download on Hugging Face. Add a few hundred megabytes of KV cache and runtime overhead and an 8B model at Q4_K_M fits comfortably inside 8 GB of VRAM with room for a 4K–8K context. Q5_K_M at 5.33 GB still fits 8 GB but leaves less context headroom; Q6_K at 6.13 GB starts to crowd an 8 GB card once context grows. If you are mapping models to a specific card, the [VRAM requirements guide](/blog/how-much-vram-to-run-local-llms) and the breakdown of [what you can run on 8GB VRAM](/blog/what-llms-can-i-run-with-8gb-vram) translate these file sizes into real fit decisions.
Yes, and this is the detail most size-versus-perplexity tables hide. Quantization does not degrade every capability evenly — it tends to magnify a model's existing weak spots, and it hits structured reasoning harder than open-ended conversation. The same Llama-3.1-8B eval shows MMLU holding up well across the range (62.43% at Q4_K_M versus 63.50% at FP16), but task-specific scores wobble more once you push below Q5.
On coding specifically, independent HumanEval runs put Q4_K_M at 51.8% Pass@1 — the same score as AWQ and bitsandbytes 4-bit on the comparable model, meaning a well-made 4-bit GGUF gives up essentially nothing on code generation versus other 4-bit schemes. The practical rule: for chat, summarization, and retrieval, Q4_K_M is plenty. For agentic coding, long-chain math, or anything where a single wrong token derails a run, step up to Q5_K_M or Q6_K if the VRAM is there. If coding is your main use, the Runyard roundup of [the best local LLMs for coding in 2026](/blog/best-local-llms-for-coding-2026) pairs model choices with the right quant level.
Quantization magnifies weaknesses rather than creating new ones. If a model is already shaky at a task in FP16, expect Q4 to make it worse there first — not somewhere random.
Stop optimizing perplexity in the abstract and frame it as a fit question: what is the largest model you can run at Q5_K_M or better, given your VRAM? A 13B at Q4_K_M almost always beats an 8B at Q8_0 on the same card, because parameter count buys more capability than the last two bits of precision. Use the quant level to claw back VRAM so you can move up a size class.
Apple Silicon shifts the math slightly: unified memory lets an M-series Mac load larger quants than a same-tier discrete GPU, and MLX builds offer their own quantization path with comparable quality at similar bit-widths. But the GGUF ranking holds — Q4_K_M as the default, Q5_K_M and Q6_K when memory allows. For a hardware-first view, the Runyard guide to the [best GPU for local LLMs in 2026](/blog/best-gpu-for-local-llms-2026) maps cards to the model sizes each can realistically hold.
One trap worth naming: do not chase Q8_0 by default. For an 8B model it costs nearly twice the disk of Q4_K_M (7.95 GB vs 4.57 GB) to recover 0.23 perplexity points and a single point of MMLU. That VRAM is almost always better spent on a larger model or a longer context window. Q8_0 earns its place only in narrow cases — quantization-sensitive fine-tunes, draft models for speculative decoding, or when you are establishing a quality baseline to measure other quants against.
For most people the quant choice is just a tag. Ollama ships a sensible default (usually Q4_K_M) and exposes other levels by name. If you want a quant that does not exist yet, llama.cpp's quantize tool builds it from an FP16 GGUF in one command. Both paths are below, copy-pasteable.
# Ollama — pull a specific quantization level by tag
ollama pull llama3.1:8b # ships Q4_K_M by default
ollama pull llama3.1:8b-instruct-q5_K_M
ollama pull llama3.1:8b-instruct-q6_K
ollama pull llama3.1:8b-instruct-q8_0
# llama.cpp — build any quant yourself from an FP16 GGUF
# 1. Convert the HF model to a full-precision GGUF
python convert_hf_to_gguf.py ./Meta-Llama-3.1-8B-Instruct \
--outfile llama3.1-8b-f16.gguf --outtype f16
# 2. Quantize to the level you want
./llama-quantize llama3.1-8b-f16.gguf llama3.1-8b-Q4_K_M.gguf Q4_K_M
./llama-quantize llama3.1-8b-f16.gguf llama3.1-8b-Q5_K_M.gguf Q5_K_M
./llama-quantize llama3.1-8b-f16.gguf llama3.1-8b-Q6_K.gguf Q6_K
# 3. Run it and check the bits-per-weight llama.cpp reports on load
./llama-cli -m llama3.1-8b-Q4_K_M.gguf -p "Explain k-quants in one sentence."When you download from a repo like bartowski or unsloth on Hugging Face, you are getting exactly these files — the quantize step has already been run for you. The tag is the only decision.
The numbers above are for one 8B model; your card and your target model size change the answer. Runyard's [VRAM Calculator](/tools/vram-calculator) takes a model and a quant level and tells you whether it fits, how much context you can afford, and the rough tokens/sec to expect — so you can test 'Q5_K_M of a 14B on a 16 GB card' before downloading 9 GB to find out. When you are torn between two models or two quant levels, the [Runyard Compare](/compare) page puts the fit, size, and capability trade-offs side by side.
See exactly which quant of which model fits your GPU — VRAM, context length, and expected speed included.
Open the VRAM Calculator → →Related reading on Runyard: [How Local LLM Inference Actually Works](/blog/how-local-llm-inference-actually-works) for the quantization internals, [How Much VRAM Do You Need to Run Local LLMs?](/blog/how-much-vram-to-run-local-llms) for fit math, [What LLMs Can You Run with 8GB VRAM?](/blog/what-llms-can-i-run-with-8gb-vram) for the entry-level tier, and [Best Local LLMs for Coding in 2026](/blog/best-local-llms-for-coding-2026) when accuracy under quantization matters most.
Tools
Find AI models that fit your exact hardware. Enter your specs and get a ranked list instantly.
Newsletter