P-06

Live result

10.7 tok/s likely

P-06

On 12 GB VRAM, this planning model puts a 8B Q4_K_M setup around 10.7 tokens per second.

Current estimate

Unknown10.7 tok/s
balanced

Smaller model feel

16.0 tok/s10.7 tok/s
8B class
How It Works

3 inputs. Instant results.

01

Set the scenario

Choose realistic hardware, model, and context assumptions.

02

Read the result

The hero shows a working result instead of a decorative promo block.

03

Act on the outcome

Use the result to adjust fit, speed, quantization, or context.

Features

Everything that powers tokens-per-second estimator.

01

Planning-first

Built to make local-AI decisions easier to reason about.

02

Local-AI focused

Built to make local-AI decisions easier to reason about.

03

Interactive hero

Built to make local-AI decisions easier to reason about.

04

Runyard design system

Built to make local-AI decisions easier to reason about.

05

GPU class or CPU-only setup

Grounded in the actual inputs and outputs this page is designed around.

06

Likely speed band

Grounded in the actual inputs and outputs this page is designed around.

07

Helpful before a long model download

Grounded in the actual inputs and outputs this page is designed around.

08

Standalone tool

Grounded in the actual inputs and outputs this page is designed around.

Spotlight

The differentiator behind tokens-per-second estimator.

7B Q4 on 8 GB GPU

Unknown~18 tok/sRTX 3070 class

14B Q4 on 16 GB GPU

Unknown~12 tok/sBorderline interactive

70B Q4 on 48 GB GPU

Unknown~4 tok/sSlow but functional

Visual comparison

Clarity
Fit
Actionability
Reading Results

How to read the output tiers.

Comfortable

<70%

Enough breathing room for normal use.

Tight

70%-95%

Should work, but overhead matters.

Borderline

95%-110%

Likely needs one tradeoff.

Too heavy

>110%

Time to step down.

Quick Reference

Common setups at useful defaults.

ScenarioBaselineResultNotes
Starter setup7B / Q4 / 8KLight local targetGood first benchmark
Balanced setup8B / Q4 / 16KEveryday sweet spotWorks for many users
Heavier setup14B / Q5 / 16KQuality-focused targetNeeds stronger hardware
Stretch setup32B / Q4 / 16KAmbitious local targetUseful upper bound

* These are approximations for planning, not a promise of exact runtime behavior.

Benefits

Why people use tokens-per-second estimator.

01

Faster decisions

It helps eliminate dead-end local AI choices before you download, benchmark, or configure too much.

02

Clearer tradeoffs

The page turns a raw estimate into something you can actually act on.

03

Useful on its own

The hero provides a working tool surface while the rest of the page explains what the output means.

FAQ

Questions people ask before using tokens-per-second estimator.

How accurate are these speed estimates?
They are planning estimates, not benchmarks. Actual speed depends on backend, driver version, CUDA version, and thermals. Treat them as a directional range — useful for deciding whether a setup is worth trying.
What speed is "interactive" for a chat assistant?
Most users find 10+ tok/s comfortable for chat. Below 5–7 tok/s, generation feels laggy. For coding, 20+ tok/s is preferred. Background batch tasks can run at any speed without a perceptible impact.
How much does quantization affect inference speed?
Lower quant is generally faster on memory-bandwidth-limited GPUs. Q4_K_M is often 1.2–1.5× faster than Q8_0 on the same GPU because it moves fewer bytes per computation cycle — bandwidth is the real constraint.
Why is Apple Silicon sometimes faster per watt than discrete GPUs?
Apple Silicon uses unified memory, eliminating the PCIe transfer bottleneck. M-series memory bandwidth is extremely high relative to power draw, making them competitive with discrete GPUs for inference workloads.
Can CPU-only inference ever be practical?
For 7B models at Q4, modern CPUs with AVX-512 or Apple Silicon can reach 5–15 tok/s — borderline interactive. Anything larger than 13B on CPU-only becomes too slow for most day-to-day use cases.
Why do some smaller models feel faster than expected?
Smaller models have narrower attention heads and fewer layers, so each token generation step involves less computation and less memory bandwidth pressure — the speed gain is non-linear, not just proportional to size.

RUNYARD.DEV / Tools / Tokens-per-Second Estimator

Estimates on this page are directional and should be validated against your actual runtime and hardware.

Copyright 2026 Runyard.devPlanning estimates only. Real-world runtime behavior may vary by backend and hardware.