← Blog/Q4_K_M vs Q5_K_M vs Q6_K vs Q8_0: How Much Quality Do You Lose?
Runyard.dev — Find AI Models That Run on Your Hardware

Q4_K_M vs Q5_K_M vs Q6_K vs Q8_0: How Much Quality Do You Lose?

A 16-bit Llama-3.1-8B weighs 14.96 GB. The Q4_K_M build of the exact same model is 4.57 GB — a 69% cut — and on WikiText its perplexity moves from 7.32 to 7.56, roughly a quarter of a point. That single comparison is the whole quantization debate: you give up a sliver of measurable quality and get two-thirds of your disk and VRAM back. The real question is not whether quantization works. It is where on the ladder from Q4_K_M up to Q8_0 the loss starts to matter for the work you actually do.

Abstract neural-network visualization representing compressed model weights and quantization of a large language model
Quantizing Llama-3.1-8B to Q4_K_M cuts it from 14.96 GB to 4.57 GB while perplexity rises just 0.24 points.

What Do Q4_K_M, Q5_K_M, and Q6_K Actually Mean?

GGUF quantization names look cryptic but decode cleanly. The number is the target bit-width per weight: Q4 is roughly four bits, Q5 five, Q6 six, Q8 eight. The `_K` marks a k-quant, the format llama.cpp has used since 2023, which stores weights in super-blocks with a two-level scale hierarchy instead of one flat scale per block. That hierarchy is why a 4-bit k-quant holds up far better than the old Q4_0 legacy format at the same size.

The trailing `_S`, `_M`, and `_L` are mixing recipes, not separate bit-widths. Q4_K_M keeps a few sensitive tensors — the attention output and feed-forward down-projections — at 6 bits while the rest sit at 4, which is why a 'four-bit' model lands closer to 4.8 effective bits per weight. Q4_K_S drops those exceptions to stay smaller. Q5_K_M and Q6_K apply the same idea at higher base precision. If you want the mechanics of how these weights load and run, the Runyard explainer on [how local LLM inference actually works](/blog/how-local-llm-inference-actually-works) walks through k-quants and mixed precision in detail.

This matters because the naming hides the real ladder. People assume Q4 to Q8 is a smooth linear trade, but the useful gradient lives between Q3 and Q5. Above Q5 you are paying for bits that barely move quality; below Q4 you fall off a cliff. The whole reason to understand the suffixes is to find that narrow band — typically Q4_K_M to Q5_K_M — where each extra gigabyte still buys measurable accuracy. Everything outside it is either wasted VRAM or false economy.

How Much Quality Do You Actually Lose?

The most useful recent reference point is a 2026 unified evaluation that ran every common llama.cpp quant on Llama-3.1-8B-Instruct against an FP16 baseline, measuring WikiText perplexity alongside MMLU, GSM8K, and IFEval. Perplexity is the cleanest single signal because it is continuous and reproducible — lower is better, and the FP16 model sets the floor at 7.32.

WikiText Perplexity by Quantization — Llama-3.1-8B (lower is better)
FP16 (baseline)
7.32perplexity
Q8_0
7.33perplexity
Q6_K
7.35perplexity
Q5_K_M
7.4perplexity
Q4_K_M
7.56perplexity
Q4_K_S
7.62perplexity
Q3_K_M
7.96perplexity

Read top-down, the curve is flat where it counts. Q8_0 (7.33) and Q6_K (7.35) are statistically indistinguishable from full precision. Q5_K_M at 7.40 costs you 0.08 points. Q4_K_M at 7.56 costs 0.24 — still small enough that blind A/B testing on ordinary prompts rarely catches it. The cliff is below four bits: Q3_K_M jumps to 7.96, an 0.64-point degradation that shows up as noticeably worse instruction-following and more frequent factual slips.

File Size and VRAM: Where the Payoff Lives

Perplexity is what you pay; disk and VRAM are what you buy. The same eval measured exact file sizes, and the spread is dramatic because precision scales the weight bytes almost linearly. This is the number that decides whether a model fits your card at all.

File Size by Quantization — Llama-3.1-8B
FP16
14.96GB
Q8_0
7.95GB
Q6_K
6.13GB
Q5_K_M
5.33GB
Q4_K_M
4.57GB
Q4_K_S
4.36GB
Q3_K_M
3.74GB

Q4_K_M's 4.57 GB is the reason it became the default download on Hugging Face. Add a few hundred megabytes of KV cache and runtime overhead and an 8B model at Q4_K_M fits comfortably inside 8 GB of VRAM with room for a 4K–8K context. Q5_K_M at 5.33 GB still fits 8 GB but leaves less context headroom; Q6_K at 6.13 GB starts to crowd an 8 GB card once context grows. If you are mapping models to a specific card, the [VRAM requirements guide](/blog/how-much-vram-to-run-local-llms) and the breakdown of [what you can run on 8GB VRAM](/blog/what-llms-can-i-run-with-8gb-vram) translate these file sizes into real fit decisions.

  • Q8_0 — 7.95 GB, 7.33 perplexity, near-lossless; needs ~10 GB VRAM with context for an 8B model.
  • Q6_K — 6.13 GB, 7.35 perplexity; the safe choice when you have a 12 GB+ card.
  • Q5_K_M — 5.33 GB, 7.40 perplexity; the quality-per-byte sweet spot.
  • Q4_K_M — 4.57 GB, 7.56 perplexity; the 8 GB default that almost everyone should start with.
  • Q3_K_M — 3.74 GB, 7.96 perplexity; only when you genuinely cannot fit Q4.

Does Quantization Hurt Coding and Math More Than Chat?

Yes, and this is the detail most size-versus-perplexity tables hide. Quantization does not degrade every capability evenly — it tends to magnify a model's existing weak spots, and it hits structured reasoning harder than open-ended conversation. The same Llama-3.1-8B eval shows MMLU holding up well across the range (62.43% at Q4_K_M versus 63.50% at FP16), but task-specific scores wobble more once you push below Q5.

MMLU Accuracy by Quantization — Llama-3.1-8B (higher is better)
FP16
63.5%
Q8_0
63.43%
Q6_K
63.17%
Q5_K_M
62.8%
Q4_K_M
62.43%
Q4_K_S
62.06%
Q3_K_M
62.01%

On coding specifically, independent HumanEval runs put Q4_K_M at 51.8% Pass@1 — the same score as AWQ and bitsandbytes 4-bit on the comparable model, meaning a well-made 4-bit GGUF gives up essentially nothing on code generation versus other 4-bit schemes. The practical rule: for chat, summarization, and retrieval, Q4_K_M is plenty. For agentic coding, long-chain math, or anything where a single wrong token derails a run, step up to Q5_K_M or Q6_K if the VRAM is there. If coding is your main use, the Runyard roundup of [the best local LLMs for coding in 2026](/blog/best-local-llms-for-coding-2026) pairs model choices with the right quant level.

Quantization magnifies weaknesses rather than creating new ones. If a model is already shaky at a task in FP16, expect Q4 to make it worse there first — not somewhere random.

Which Quantization Should You Pick?

Stop optimizing perplexity in the abstract and frame it as a fit question: what is the largest model you can run at Q5_K_M or better, given your VRAM? A 13B at Q4_K_M almost always beats an 8B at Q8_0 on the same card, because parameter count buys more capability than the last two bits of precision. Use the quant level to claw back VRAM so you can move up a size class.

  • 8 GB VRAM (RTX 4060, 3070): run 7–8B at Q4_K_M, or a 7B at Q5_K_M if you keep context modest.
  • 12 GB VRAM (RTX 4070, 3060 12GB): run 8B at Q6_K, or step up to a 13–14B at Q4_K_M.
  • 16 GB VRAM (RTX 4080, 4060 Ti 16GB): run 13–14B at Q5_K_M, or a 7B at Q8_0 for max fidelity.
  • 24 GB VRAM (RTX 4090, 3090): run a 32B at Q4_K_M, or a 14B at Q6_K with long context.

Apple Silicon shifts the math slightly: unified memory lets an M-series Mac load larger quants than a same-tier discrete GPU, and MLX builds offer their own quantization path with comparable quality at similar bit-widths. But the GGUF ranking holds — Q4_K_M as the default, Q5_K_M and Q6_K when memory allows. For a hardware-first view, the Runyard guide to the [best GPU for local LLMs in 2026](/blog/best-gpu-for-local-llms-2026) maps cards to the model sizes each can realistically hold.

One trap worth naming: do not chase Q8_0 by default. For an 8B model it costs nearly twice the disk of Q4_K_M (7.95 GB vs 4.57 GB) to recover 0.23 perplexity points and a single point of MMLU. That VRAM is almost always better spent on a larger model or a longer context window. Q8_0 earns its place only in narrow cases — quantization-sensitive fine-tunes, draft models for speculative decoding, or when you are establishing a quality baseline to measure other quants against.

How to Get and Build Each Quant Level

For most people the quant choice is just a tag. Ollama ships a sensible default (usually Q4_K_M) and exposes other levels by name. If you want a quant that does not exist yet, llama.cpp's quantize tool builds it from an FP16 GGUF in one command. Both paths are below, copy-pasteable.

terminalbash
# Ollama — pull a specific quantization level by tag
ollama pull llama3.1:8b              # ships Q4_K_M by default
ollama pull llama3.1:8b-instruct-q5_K_M
ollama pull llama3.1:8b-instruct-q6_K
ollama pull llama3.1:8b-instruct-q8_0

# llama.cpp — build any quant yourself from an FP16 GGUF
# 1. Convert the HF model to a full-precision GGUF
python convert_hf_to_gguf.py ./Meta-Llama-3.1-8B-Instruct \
  --outfile llama3.1-8b-f16.gguf --outtype f16

# 2. Quantize to the level you want
./llama-quantize llama3.1-8b-f16.gguf llama3.1-8b-Q4_K_M.gguf Q4_K_M
./llama-quantize llama3.1-8b-f16.gguf llama3.1-8b-Q5_K_M.gguf Q5_K_M
./llama-quantize llama3.1-8b-f16.gguf llama3.1-8b-Q6_K.gguf   Q6_K

# 3. Run it and check the bits-per-weight llama.cpp reports on load
./llama-cli -m llama3.1-8b-Q4_K_M.gguf -p "Explain k-quants in one sentence."

When you download from a repo like bartowski or unsloth on Hugging Face, you are getting exactly these files — the quantize step has already been run for you. The tag is the only decision.

Find the Right Quant for Your Hardware on Runyard

The numbers above are for one 8B model; your card and your target model size change the answer. Runyard's [VRAM Calculator](/tools/vram-calculator) takes a model and a quant level and tells you whether it fits, how much context you can afford, and the rough tokens/sec to expect — so you can test 'Q5_K_M of a 14B on a 16 GB card' before downloading 9 GB to find out. When you are torn between two models or two quant levels, the [Runyard Compare](/compare) page puts the fit, size, and capability trade-offs side by side.

See exactly which quant of which model fits your GPU — VRAM, context length, and expected speed included.

Open the VRAM Calculator →

Related reading on Runyard: [How Local LLM Inference Actually Works](/blog/how-local-llm-inference-actually-works) for the quantization internals, [How Much VRAM Do You Need to Run Local LLMs?](/blog/how-much-vram-to-run-local-llms) for fit math, [What LLMs Can You Run with 8GB VRAM?](/blog/what-llms-can-i-run-with-8gb-vram) for the entry-level tier, and [Best Local LLMs for Coding in 2026](/blog/best-local-llms-for-coding-2026) when accuracy under quantization matters most.

RUNYARD.DEV

Hardware-aware AI model discovery. Know exactly what runs on your machine — before you download.

© 2026 RUNYARD.DEV — All rights reserved.

Built for local AI.

Tools

Try Runyard

Find AI models that fit your exact hardware. Enter your specs and get a ranked list instantly.

Newsletter