Runyard is a free hardware-aware AI model browser. You enter your CPU, GPU, and VRAM and it instantly shows every local LLM that will run on your machine, ranked by speed and quality.

How much VRAM do I need to run local LLMs?

8GB of VRAM runs 7B models like Llama 3.1 8B and Mistral 7B at Q4 quantization. 16GB unlocks 13B models. 24GB lets you run Mixtral 8x7B and Llama 3 70B at lower quantization.

What is the best local LLM for my GPU?

Use Runyard at www.runyard.dev — enter your GPU and VRAM and the Model Radar will rank every compatible LLM for your exact hardware, showing estimated tokens per second for each model.

Can I run Llama 3 locally?

Yes. Llama 3.1 8B at Q4 runs on any 8GB VRAM GPU. Llama 3.1 70B needs around 40GB VRAM at Q4, or an Apple Silicon Mac with 64GB+ unified memory.

Reference

Local AI glossary.

Plain-English definitions for the terms that come up when running language models locally. From quantization formats (GGUF, AWQ, EXL2) to inference engines (llama.cpp, Ollama, vLLM) to the model architectures themselves.

GPU: Graphics Processing Unit. Originally for rendering graphics; now the standard accelerator for AI inference because its thousands of parallel cores match the matrix math LLMs do.
VRAM: Video RAM — the high-bandwidth memory soldered onto a GPU. The single biggest constraint on which local LLMs you can run, because the entire model usually has to fit inside it.
GGUF: GPT-Generated Unified Format. The current standard file format for quantized LLM weights, designed for llama.cpp. Replaces the older GGML format and stores metadata, tokenizer, and quantized tensors in one file.
AWQ: Activation-aware Weight Quantization. A quantization scheme that protects salient weights based on activation magnitudes. Popular for GPU inference via vLLM and Aphrodite.
EXL2: ExLlamaV2 quantization format. A GPU-only quant format that supports per-tensor bit widths (e.g. 3.5 bpw, 4.65 bpw). Common on consumer NVIDIA GPUs via the ExLlamaV2 engine.
GPTQ: A post-training quantization method that minimizes layerwise reconstruction error. One of the earliest 4-bit schemes, still widely supported but largely superseded by AWQ and EXL2 for GPU and GGUF for CPU/GPU hybrid.
MLX: Apple's array framework for Apple Silicon. The native way to run LLMs on M1/M2/M3/M4 Macs with unified memory. Comparable to PyTorch but designed around the Metal Performance Shaders backend.
Q4_K_M: A 4-bit GGUF quantization variant that uses K-quant blocks with mixed precision inside each block. The most common default for local inference — best balance of size, speed, and quality for most users.
Q5_K_M: A 5-bit GGUF K-quant variant. Slightly larger and slower than Q4_K_M but closer to FP16 quality. Worth it on coding and reasoning workloads when VRAM allows.
Q6_K: A 6-bit GGUF K-quant variant. Near-FP16 quality at a meaningful size reduction. Common choice for users who want to minimize quality loss on a tight VRAM budget.
Q8_0: An 8-bit GGUF quantization with no K-quant grouping. About 99% of FP16 quality at half the size. The conservative default for users who can afford the VRAM.
FP16: Half-precision floating-point (16-bit). The native precision for most LLMs before quantization. A 7B model in FP16 needs roughly 14 GB of memory just for weights.
BF16: Brain Float 16. Same 16-bit width as FP16 but with FP32-range exponent. Preferred for training because it avoids the overflow problems of FP16. Most modern open-source models are released in BF16.
INT8: 8-bit signed integer. A weight precision used by some quantization schemes (e.g. bitsandbytes int8). Roughly half the memory of FP16 with minimal quality loss.
INT4: 4-bit signed integer. The most aggressive widely-used quantization precision. Cuts model size by ~4x vs FP16 with measurable but usually acceptable quality loss.
Tokens per second: The throughput metric for LLM inference. How many tokens the model can generate per second on your hardware. Above 20 tok/s feels interactive; below 10 tok/s feels slow.
KV cache: Key/Value cache. The memory used to store attention keys and values for every token already in the context. Grows linearly with context length and is often the hidden reason long-context runs OOM.
Context length: The maximum number of tokens a model can attend to at once, counting prompt plus generated output. Modern open models range from 8K to 128K+ tokens.
Quantization: The process of compressing model weights from FP16/BF16 down to lower-precision formats (INT8, INT4, or further) to reduce memory and speed up inference. Trades a small amount of quality for a large amount of size.
Inference: Running a trained model to generate output, as opposed to training it. Local LLM tooling is almost entirely about inference.
MoE: Mixture of Experts. A model architecture where only a subset of "expert" sublayers activates per token. Mixtral 8x7B and DeepSeek-V2 are MoE: bigger total parameter counts than they "use" per token.
Active parameters: For MoE models, the number of parameters actually used to predict each token. A 47B MoE may only use 13B active parameters, which is what governs inference speed.
Latency: Time-to-first-token. The delay between sending a prompt and seeing the first output token. Distinct from throughput (tokens/sec), and dominated by prompt processing time for long inputs.
Throughput: Tokens generated per unit time, usually tokens per second. The headline number on most local-LLM benchmarks.
llama.cpp: A C/C++ inference engine for LLMs. The reference implementation for GGUF and the engine under Ollama, LM Studio, GPT4All, Jan, and most local-AI desktop apps.
Ollama: A wrapper around llama.cpp that adds a simple CLI, a model registry, and an OpenAI-compatible HTTP server. The most popular way to run local LLMs.
LM Studio: A desktop application for downloading and chatting with local LLMs, built on llama.cpp. Strong GUI for users who do not want a terminal.
GPT4All: A cross-platform desktop app and model ecosystem for local LLMs, maintained by Nomic. Optimized for running models on CPU as well as GPU.
Jan: An open-source local AI chat app built on llama.cpp, with an OpenAI-compatible API. Often used as a self-hosted alternative to LM Studio.
vLLM: A GPU inference server optimized for high-throughput serving. Implements PagedAttention to manage KV cache like virtual memory. Standard in production deployments.
Aphrodite: An inference engine derived from vLLM with extended quantization support (AWQ, GPTQ, EXL2, GGUF) and additional sampling features. Popular in the open-source LLM hosting community.
Llama: Meta's family of open-weight LLMs (Llama 2, Llama 3, Llama 3.1, Llama 3.2, Llama 3.3). The dominant base architecture for the local-AI ecosystem.
Mistral: Mistral AI's family of open-weight models (Mistral 7B, Mixtral 8x7B/8x22B, Mistral Small/Large). Best-known for strong performance per parameter and an early MoE release.
Qwen: Alibaba's open-weight model family. Qwen 2.5 and Qwen 3 are competitive with frontier closed models on many benchmarks and have permissive licenses.
Gemma: Google's family of open-weight models distilled from the Gemini line. Gemma 2 and Gemma 3 are popular small/medium model choices for local inference.
Phi: Microsoft Research's small-model family, optimized for reasoning per parameter. Phi-3 and Phi-4 are the typical "runs on integrated graphics" picks.
DeepSeek: A Chinese open-weight model family, including the DeepSeek-V2/V3 MoE chat models and DeepSeek-R1 reasoning model. Strong on code and math.
RAG: Retrieval-Augmented Generation. A pattern where a system retrieves relevant documents from a knowledge base and includes them in the LLM prompt, instead of relying on the model's parametric memory alone.
Embedding: A dense vector representation of a piece of text. Used in RAG, semantic search, and clustering. Generated by a dedicated embedding model rather than a generative LLM.
Transformer: The neural network architecture underlying every modern LLM. Built around self-attention, residual connections, and feed-forward layers stacked many times.
Attention: The mechanism by which a transformer mixes information across tokens. Each token computes a weighted sum over every other token's value vector, weighted by query/key similarity.
Multi-head attention: Running attention multiple times in parallel with different learned projections, then concatenating the results. Lets the model attend to different patterns simultaneously.
Flash Attention: An IO-aware attention algorithm that fuses operations and avoids materializing the full attention matrix. Standard for modern GPU inference; faster and uses less VRAM than naive attention.
RoPE: Rotary Position Embedding. A way of encoding token position by rotating query/key vectors in 2D subspaces. Used by Llama, Mistral, Qwen, and most modern open-source LLMs.
SwiGLU: Swish-Gated Linear Unit. The activation function used in the feed-forward layers of Llama and most modern open-source LLMs. Outperforms ReLU and GELU at scale.
RMSNorm: Root Mean Square Layer Normalization. A simpler, slightly faster alternative to LayerNorm. Used by Llama and many of its descendants.
Layer norm: A normalization that scales activations to zero mean and unit variance across the feature dimension. The original normalization used in transformers; largely replaced by RMSNorm in modern open models.
Mixed precision: Running parts of a model in FP16/BF16 and other parts in FP32 for stability. Common during training; less relevant for pure inference but still used in some quantization schemes.
TensorRT-LLM: NVIDIA's production inference engine for LLMs. Compiles a model to a hardware-specific TensorRT engine for maximum throughput on NVIDIA GPUs.
CUDA: NVIDIA's parallel computing platform and API. The default way to run LLM inference on NVIDIA GPUs.
ROCm: AMD's open-source GPU compute stack, the equivalent of CUDA for AMD cards. Supported by llama.cpp, vLLM, and PyTorch with growing maturity.
Metal: Apple's low-level graphics and compute API. The backend llama.cpp and MLX use to run LLMs on Mac GPUs.
MPS: Metal Performance Shaders. Apple's compute primitives layered on Metal. PyTorch's MPS backend uses these to run on Apple Silicon.
NPU: Neural Processing Unit. A dedicated accelerator for AI workloads, increasingly common on consumer laptop SoCs (Qualcomm, AMD Ryzen AI, Intel Core Ultra, Apple). LLM tooling support is still maturing.
BPW: Bits Per Weight. A continuous measure of quantization precision. EXL2 in particular reports BPW values like 3.5 or 4.65 directly instead of discrete Q-levels.

Missing a term? Email hello@runyard.dev and we will add it.