Methodology

How Runyard benchmarks local LLMs.

Every number you see on Runyard — VRAM required, tokens-per-second estimate, recommended quantization — comes from a documented test or a documented extrapolation. This page is the long version. If something here looks wrong, email hello@runyard.dev and we will fix it.

What we test on

We maintain a fixed reference hardware set so every result is comparable over time. Where we lack a card we estimate from architecturally similar hardware and mark the result accordingly.

  • NVIDIA consumer: RTX 3060 12 GB, RTX 4060 Ti 16 GB, RTX 4070 Ti Super 16 GB, RTX 4090 24 GB, RTX 5090 32 GB.
  • NVIDIA workstation: RTX A6000 48 GB, RTX 6000 Ada 48 GB.
  • Apple Silicon: M2 Pro 32 GB, M3 Max 64 GB, M4 Max 128 GB.
  • AMD: RX 7900 XTX 24 GB (ROCm).
  • CPU only: Ryzen 7 7700X 64 GB DDR5, Intel Core Ultra 9 285K.

What models we test

We prioritize open-weight models that are actually downloaded by local-AI users in meaningful numbers. The current reference set:

  • Llama 3.1 8B, Llama 3.1 70B, Llama 3.3 70B
  • Mistral 7B, Mixtral 8x7B, Mixtral 8x22B, Mistral Small 3
  • Qwen 2.5 7B, Qwen 2.5 14B, Qwen 2.5 32B, Qwen 2.5 72B, Qwen 3 family
  • Gemma 2 9B, Gemma 2 27B, Gemma 3
  • Phi-3 Mini, Phi-3 Medium, Phi-4
  • DeepSeek-V3, DeepSeek-R1 (where loadable)
  • CodeLlama 13B, Qwen2.5-Coder 7B/14B/32B

Inference engines

We test each model on the engine most representative of how local users actually run it. We do not cherry-pick the fastest engine for headline numbers — we report the engine and its commit hash next to every result.

  • GGUF on NVIDIA / CPU / Apple: llama.cpp (latest tagged release).
  • EXL2 / GPU-only: ExLlamaV2 via tabbyAPI.
  • AWQ / GPTQ at production load: vLLM (latest stable).
  • Apple Silicon native: MLX (latest stable).
  • User-facing wrappers: we cross-check headline numbers against Ollama and LM Studio to make sure our llama.cpp results match what most users actually see.

How we measure tokens per second

The tokens-per-second numbers on Runyard are the generation throughput after prompt processing is complete. They reflect how fast the model streams output to a user during interactive use.

Specifically:

  • Prompt: a fixed 512-token instruction prompt drawn from a small held-out set.
  • Generation: 256 new tokens, greedy decoding, temperature 0.
  • Batch size: 1 (single-user local scenario). Throughput numbers for vLLM at higher batch are labeled separately.
  • Reported number: median of three runs after a one-run warm-up. We also log min/max so outliers are visible.
  • We do not include first-token latency in the tokens/sec number; it is reported separately.

How we measure VRAM

Every "VRAM required" number on Runyard is the peak resident VRAM during the 256-token generation step. Resident — not reserved. CUDA and Metal both over-reserve, and quoting reserved numbers would lie about real fit.

  • Included: model weights, KV cache at the tested context length, attention scratch buffers, runtime overhead.
  • Excluded: the desktop compositor's VRAM (typically 0.5–1.5 GB on a daily-driver machine — budget for it before downloading).
  • Measured with: nvidia-smi --query-gpu=memory.used on NVIDIA, the metal_resident_memory counter on Apple Silicon, and rocm-smi on AMD.

What we exclude

We deliberately do not chase numbers that would mislead local users:

  • We do not benchmark with --no-mmap tricks or speculative decoding unless they are the default in the engine's shipping config.
  • We do not report multi-GPU tensor-parallel throughput unless the page is explicitly about multi-GPU.
  • We do not pre-warm caches across runs.
  • We do not test on overclocked or BIOS-tweaked hardware.

Replicating our numbers

Every benchmark post on Runyard ships with the exact command and engine version. As a baseline replication for a GGUF model on llama.cpp:

./llama-bench \
  -m ./models/llama-3.1-8b-q4_k_m.gguf \
  -ngl 999 \
  -p 512 -n 256 \
  -r 3

If your number is more than ~10% off ours and you are on the same engine, quant, and hardware, that is interesting and we want to know. The most common cause is background apps holding VRAM, but it is sometimes a driver or thermal issue we should flag for other readers.

Update cadence

We re-benchmark the reference model and hardware sets quarterly, and publish a short changelog when numbers move materially. Inference engines change quickly; a 15 tok/s number from six months ago can be 20 tok/s today on the same hardware after llama.cpp picked up new CUDA kernels.