Runyard is a free hardware-aware AI model browser. You enter your CPU, GPU, and VRAM and it instantly shows every local LLM that will run on your machine, ranked by speed and quality.

How much VRAM do I need to run local LLMs?

8GB of VRAM runs 7B models like Llama 3.1 8B and Mistral 7B at Q4 quantization. 16GB unlocks 13B models. 24GB lets you run Mixtral 8x7B and Llama 3 70B at lower quantization.

What is the best local LLM for my GPU?

Use Runyard at www.runyard.dev — enter your GPU and VRAM and the Model Radar will rank every compatible LLM for your exact hardware, showing estimated tokens per second for each model.

Can I run Llama 3 locally?

Yes. Llama 3.1 8B at Q4 runs on any 8GB VRAM GPU. Llama 3.1 70B needs around 40GB VRAM at Q4, or an Apple Silicon Mac with 64GB+ unified memory.

Model Size to RAM / VRAM Converter

P-08

Model size

Quantization

GPU or unified memory

System RAM

Context target

Quality preference

Speed preference

TurboQuant KV compression

Live result

5.59 GB estimated

P-08

Model Size to RAM / VRAM Converter puts this setup around 5.59 GB including rough runtime overhead.

Weights

-4.50 GB

Q4_K_M

KV cache

-0.29 GB

8K context

How It Works

3 inputs. Instant results.

01

Set the scenario

Choose realistic hardware, model, and context assumptions.

02

Read the result

The hero shows a working result instead of a decorative promo block.

03

Act on the outcome

Use the result to adjust fit, speed, quantization, or context.

Features

Everything that powers model size to ram / vram converter.

01

Planning-first

Built to make local-AI decisions easier to reason about.

02

Local-AI focused

Built to make local-AI decisions easier to reason about.

03

Interactive hero

Built to make local-AI decisions easier to reason about.

04

Runyard design system

Built to make local-AI decisions easier to reason about.

05

Parameter count

Grounded in the actual inputs and outputs this page is designed around.

06

Approximate memory footprint

Grounded in the actual inputs and outputs this page is designed around.

07

Good companion to model cards

Grounded in the actual inputs and outputs this page is designed around.

08

Standalone tool

Grounded in the actual inputs and outputs this page is designed around.

Spotlight

The differentiator behind model size to ram / vram converter.

7B Q4_K_M

~3.9 GB weights~4.4 GB total+0.5 GB overhead

14B Q4_K_M

~7.9 GB weights~8.4 GB total+0.5 GB overhead

70B Q4_K_M

~39.4 GB weights~39.9 GB totalNeeds 48 GB GPU

Visual comparison

Clarity

Fit

Actionability

Reading Results

How to read the output tiers.

Comfortable

<70%

Enough breathing room for normal use.

Tight

70%-95%

Should work, but overhead matters.

Borderline

95%-110%

Likely needs one tradeoff.

Too heavy

>110%

Time to step down.

Quick Reference

Common setups at useful defaults.

Scenario	Baseline	Result	Notes
Starter setup	7B / Q4 / 8K	Light local target	Good first benchmark
Balanced setup	8B / Q4 / 16K	Everyday sweet spot	Works for many users
Heavier setup	14B / Q5 / 16K	Quality-focused target	Needs stronger hardware
Stretch setup	32B / Q4 / 16K	Ambitious local target	Useful upper bound

* These are approximations for planning, not a promise of exact runtime behavior.

Benefits

Why people use model size to ram / vram converter.

01

Faster decisions

It helps eliminate dead-end local AI choices before you download, benchmark, or configure too much.

02

Clearer tradeoffs

The page turns a raw estimate into something you can actually act on.

03

Useful on its own

The hero provides a working tool surface while the rest of the page explains what the output means.

FAQ

Questions people ask before using model size to ram / vram converter.

How do I convert model parameters to VRAM?

VRAM (GB) = (params_B × bits_per_weight) / 8. For 7B at Q4_K_M (4.5 bpw): 7 × 4.5 / 8 = 3.94 GB + ~0.5 GB overhead = ~4.4 GB total. F16 uses 16 bpw: 7 × 16 / 8 = 14 GB.

What is bits-per-weight (bpw)?

It measures how many bits store each model weight. F16 uses 16 bits. Q4_K_M uses ~4.5 bits. Q8_0 uses ~8.5 bits. Lower bpw means a smaller, faster model with slightly reduced precision.

Can models run on CPU RAM instead of VRAM?

Yes. The same size calculation applies — the model loads into system RAM instead. Inference runs on CPU. Speed is typically 5–20× slower than GPU inference but fully functional for smaller models.

What is the difference between model size on disk and in memory?

They are very close. A GGUF file contains the quantized weights. When loaded, the runtime adds ~0.5 GB for KV cache allocation, activations, and runtime buffers. Expect 5–15% more than the raw file size in practice.

Why do VRAM estimates vary slightly from file size?

GGUF files pack weights tightly on disk. The runtime unpacks them, allocates the KV cache proportional to context length, and reserves buffer space. Actual VRAM use is consistently a bit higher than the file size.

What is the quick rule of thumb for popular model sizes?

At Q4_K_M: 7B ≈ 5 GB, 14B ≈ 9 GB, 32B ≈ 20 GB, 70B ≈ 40 GB. These include a small runtime overhead. MoE models are much cheaper — DeepSeek R1 671B at Q4 needs only ~22 GB (37B active experts).

RUNYARD.DEV / Tools / Model Size to RAM / VRAM Converter

Estimates on this page are directional and should be validated against your actual runtime and hardware.