Runyard is a free hardware-aware AI model browser. You enter your CPU, GPU, and VRAM and it instantly shows every local LLM that will run on your machine, ranked by speed and quality.

How much VRAM do I need to run local LLMs?

8GB of VRAM runs 7B models like Llama 3.1 8B and Mistral 7B at Q4 quantization. 16GB unlocks 13B models. 24GB lets you run Mixtral 8x7B and Llama 3 70B at lower quantization.

What is the best local LLM for my GPU?

Use Runyard at www.runyard.dev — enter your GPU and VRAM and the Model Radar will rank every compatible LLM for your exact hardware, showing estimated tokens per second for each model.

Can I run Llama 3 locally?

Yes. Llama 3.1 8B at Q4 runs on any 8GB VRAM GPU. Llama 3.1 70B needs around 40GB VRAM at Q4, or an Apple Silicon Mac with 64GB+ unified memory.

GPU-to-Model Fit Checker

P-04

Model size

Quantization

GPU or unified memory

System RAM

Context target

Quality preference

Speed preference

TurboQuant KV compression

Live result

8B class is your realistic next stop

P-04

These three pages stay lighter in the hero on purpose. The full live matching flow already exists on Runyard home.

Likely model class

Unknown8B class

12 GB VRAM

Best next action

Manual guessingOpen Model Radar

Use the main product

How It Works

3 inputs. Instant results.

01

Set the scenario

Choose realistic hardware, model, and context assumptions.

02

Read the result

The hero shows a working result instead of a decorative promo block.

03

Jump to Runyard home

The three product-led pages hand off to the main live experience.

Features

Everything that powers gpu-to-model fit checker.

01

Planning-first

Built to make local-AI decisions easier to reason about.

02

Local-AI focused

Built to make local-AI decisions easier to reason about.

03

Interactive hero

Built to make local-AI decisions easier to reason about.

04

Runyard design system

Built to make local-AI decisions easier to reason about.

05

Your GPU or unified memory device

Grounded in the actual inputs and outputs this page is designed around.

06

Model shortlist

Grounded in the actual inputs and outputs this page is designed around.

07

Direct gateway to Model Radar

Grounded in the actual inputs and outputs this page is designed around.

08

Gateway handoff

Grounded in the actual inputs and outputs this page is designed around.

Spotlight

The differentiator behind gpu-to-model fit checker.

Before

GuessingInteractive resultHero section works

Reading output

Raw numbersGuided interpretationEasier next step

Product handoff

Duplicated productGateway-only heroFor the 3 requested pages

Visual comparison

Clarity

Fit

Actionability

Reading Results

How to read the output tiers.

Comfortable

<70%

Enough breathing room for normal use.

Tight

70%-95%

Should work, but overhead matters.

Borderline

95%-110%

Likely needs one tradeoff.

Too heavy

>110%

Time to step down.

Quick Reference

Common setups at useful defaults.

Scenario	Baseline	Result	Notes
Starter setup	7B / Q4 / 8K	Light local target	Good first benchmark
Balanced setup	8B / Q4 / 16K	Everyday sweet spot	Works for many users
Heavier setup	14B / Q5 / 16K	Quality-focused target	Needs stronger hardware
Stretch setup	32B / Q4 / 16K	Ambitious local target	Useful upper bound

* These are approximations for planning, not a promise of exact runtime behavior.

Benefits

Why people use gpu-to-model fit checker.

01

Faster decisions

It helps eliminate dead-end local AI choices before you download, benchmark, or configure too much.

02

Clearer tradeoffs

The page turns a raw estimate into something you can actually act on.

03

Cleaner handoff to Runyard

These three pages deliberately hand off to the main product instead of pretending to replace it.

FAQ

Questions people ask before using gpu-to-model fit checker.

What is GPU-to-model fit checking?

It is the process of comparing your GPU's VRAM against a model's memory requirements to determine if it will load and run without error. Runyard Model Radar does this live for every model in the catalogue.

What happens if a model doesn't fit my GPU?

The model either fails to load with an OOM error, or falls back to CPU offloading. Offloading can reduce inference speed by 5–20× depending on how many layers overflow to RAM.

Does this apply to Apple Silicon too?

Yes. Apple Silicon uses unified memory shared between CPU and GPU. The fit logic is the same — if total allocation exceeds available unified memory, you'll see slowdowns or OOM failures.

What is the minimum comfortable VRAM headroom?

Leave 15–20% headroom as a rule of thumb. A 12 GB GPU should run models needing at most 9–10 GB. This leaves room for context growth, overhead, and background system processes.

Can I run a model that's slightly too large with offloading?

Sometimes. Partial CPU offloading works in Ollama and llama.cpp. For 1–2 GB overflow it can still be practical — the offloaded layers run on CPU, slowing output proportionally but not catastrophically.

Where can I find the live GPU-to-model fit check?

Runyard Model Radar is the live GPU-to-model fit checker. Select your GPU and see every model scored by fit, speed, and context. This page explains the concept — home does the live matching.

RUNYARD.DEV / Tools / GPU-to-Model Fit Checker

Estimates on this page are directional and should be validated against your actual runtime and hardware.