Reference

Local AI glossary.

Plain-English definitions for the terms that come up when running language models locally. From quantization formats (GGUF, AWQ, EXL2) to inference engines (llama.cpp, Ollama, vLLM) to the model architectures themselves.

GPU
Graphics Processing Unit. Originally for rendering graphics; now the standard accelerator for AI inference because its thousands of parallel cores match the matrix math LLMs do.
VRAM
Video RAM — the high-bandwidth memory soldered onto a GPU. The single biggest constraint on which local LLMs you can run, because the entire model usually has to fit inside it.
GGUF
GPT-Generated Unified Format. The current standard file format for quantized LLM weights, designed for llama.cpp. Replaces the older GGML format and stores metadata, tokenizer, and quantized tensors in one file.
AWQ
Activation-aware Weight Quantization. A quantization scheme that protects salient weights based on activation magnitudes. Popular for GPU inference via vLLM and Aphrodite.
EXL2
ExLlamaV2 quantization format. A GPU-only quant format that supports per-tensor bit widths (e.g. 3.5 bpw, 4.65 bpw). Common on consumer NVIDIA GPUs via the ExLlamaV2 engine.
GPTQ
A post-training quantization method that minimizes layerwise reconstruction error. One of the earliest 4-bit schemes, still widely supported but largely superseded by AWQ and EXL2 for GPU and GGUF for CPU/GPU hybrid.
MLX
Apple's array framework for Apple Silicon. The native way to run LLMs on M1/M2/M3/M4 Macs with unified memory. Comparable to PyTorch but designed around the Metal Performance Shaders backend.
Q4_K_M
A 4-bit GGUF quantization variant that uses K-quant blocks with mixed precision inside each block. The most common default for local inference — best balance of size, speed, and quality for most users.
Q5_K_M
A 5-bit GGUF K-quant variant. Slightly larger and slower than Q4_K_M but closer to FP16 quality. Worth it on coding and reasoning workloads when VRAM allows.
Q6_K
A 6-bit GGUF K-quant variant. Near-FP16 quality at a meaningful size reduction. Common choice for users who want to minimize quality loss on a tight VRAM budget.
Q8_0
An 8-bit GGUF quantization with no K-quant grouping. About 99% of FP16 quality at half the size. The conservative default for users who can afford the VRAM.
FP16
Half-precision floating-point (16-bit). The native precision for most LLMs before quantization. A 7B model in FP16 needs roughly 14 GB of memory just for weights.
BF16
Brain Float 16. Same 16-bit width as FP16 but with FP32-range exponent. Preferred for training because it avoids the overflow problems of FP16. Most modern open-source models are released in BF16.
INT8
8-bit signed integer. A weight precision used by some quantization schemes (e.g. bitsandbytes int8). Roughly half the memory of FP16 with minimal quality loss.
INT4
4-bit signed integer. The most aggressive widely-used quantization precision. Cuts model size by ~4x vs FP16 with measurable but usually acceptable quality loss.
Tokens per second
The throughput metric for LLM inference. How many tokens the model can generate per second on your hardware. Above 20 tok/s feels interactive; below 10 tok/s feels slow.
KV cache
Key/Value cache. The memory used to store attention keys and values for every token already in the context. Grows linearly with context length and is often the hidden reason long-context runs OOM.
Context length
The maximum number of tokens a model can attend to at once, counting prompt plus generated output. Modern open models range from 8K to 128K+ tokens.
Quantization
The process of compressing model weights from FP16/BF16 down to lower-precision formats (INT8, INT4, or further) to reduce memory and speed up inference. Trades a small amount of quality for a large amount of size.
Inference
Running a trained model to generate output, as opposed to training it. Local LLM tooling is almost entirely about inference.
MoE
Mixture of Experts. A model architecture where only a subset of "expert" sublayers activates per token. Mixtral 8x7B and DeepSeek-V2 are MoE: bigger total parameter counts than they "use" per token.
Active parameters
For MoE models, the number of parameters actually used to predict each token. A 47B MoE may only use 13B active parameters, which is what governs inference speed.
Latency
Time-to-first-token. The delay between sending a prompt and seeing the first output token. Distinct from throughput (tokens/sec), and dominated by prompt processing time for long inputs.
Throughput
Tokens generated per unit time, usually tokens per second. The headline number on most local-LLM benchmarks.
llama.cpp
A C/C++ inference engine for LLMs. The reference implementation for GGUF and the engine under Ollama, LM Studio, GPT4All, Jan, and most local-AI desktop apps.
Ollama
A wrapper around llama.cpp that adds a simple CLI, a model registry, and an OpenAI-compatible HTTP server. The most popular way to run local LLMs.
LM Studio
A desktop application for downloading and chatting with local LLMs, built on llama.cpp. Strong GUI for users who do not want a terminal.
GPT4All
A cross-platform desktop app and model ecosystem for local LLMs, maintained by Nomic. Optimized for running models on CPU as well as GPU.
Jan
An open-source local AI chat app built on llama.cpp, with an OpenAI-compatible API. Often used as a self-hosted alternative to LM Studio.
vLLM
A GPU inference server optimized for high-throughput serving. Implements PagedAttention to manage KV cache like virtual memory. Standard in production deployments.
Aphrodite
An inference engine derived from vLLM with extended quantization support (AWQ, GPTQ, EXL2, GGUF) and additional sampling features. Popular in the open-source LLM hosting community.
Llama
Meta's family of open-weight LLMs (Llama 2, Llama 3, Llama 3.1, Llama 3.2, Llama 3.3). The dominant base architecture for the local-AI ecosystem.
Mistral
Mistral AI's family of open-weight models (Mistral 7B, Mixtral 8x7B/8x22B, Mistral Small/Large). Best-known for strong performance per parameter and an early MoE release.
Qwen
Alibaba's open-weight model family. Qwen 2.5 and Qwen 3 are competitive with frontier closed models on many benchmarks and have permissive licenses.
Gemma
Google's family of open-weight models distilled from the Gemini line. Gemma 2 and Gemma 3 are popular small/medium model choices for local inference.
Phi
Microsoft Research's small-model family, optimized for reasoning per parameter. Phi-3 and Phi-4 are the typical "runs on integrated graphics" picks.
DeepSeek
A Chinese open-weight model family, including the DeepSeek-V2/V3 MoE chat models and DeepSeek-R1 reasoning model. Strong on code and math.
RAG
Retrieval-Augmented Generation. A pattern where a system retrieves relevant documents from a knowledge base and includes them in the LLM prompt, instead of relying on the model's parametric memory alone.
Embedding
A dense vector representation of a piece of text. Used in RAG, semantic search, and clustering. Generated by a dedicated embedding model rather than a generative LLM.
Transformer
The neural network architecture underlying every modern LLM. Built around self-attention, residual connections, and feed-forward layers stacked many times.
Attention
The mechanism by which a transformer mixes information across tokens. Each token computes a weighted sum over every other token's value vector, weighted by query/key similarity.
Multi-head attention
Running attention multiple times in parallel with different learned projections, then concatenating the results. Lets the model attend to different patterns simultaneously.
Flash Attention
An IO-aware attention algorithm that fuses operations and avoids materializing the full attention matrix. Standard for modern GPU inference; faster and uses less VRAM than naive attention.
RoPE
Rotary Position Embedding. A way of encoding token position by rotating query/key vectors in 2D subspaces. Used by Llama, Mistral, Qwen, and most modern open-source LLMs.
SwiGLU
Swish-Gated Linear Unit. The activation function used in the feed-forward layers of Llama and most modern open-source LLMs. Outperforms ReLU and GELU at scale.
RMSNorm
Root Mean Square Layer Normalization. A simpler, slightly faster alternative to LayerNorm. Used by Llama and many of its descendants.
Layer norm
A normalization that scales activations to zero mean and unit variance across the feature dimension. The original normalization used in transformers; largely replaced by RMSNorm in modern open models.
Mixed precision
Running parts of a model in FP16/BF16 and other parts in FP32 for stability. Common during training; less relevant for pure inference but still used in some quantization schemes.
TensorRT-LLM
NVIDIA's production inference engine for LLMs. Compiles a model to a hardware-specific TensorRT engine for maximum throughput on NVIDIA GPUs.
CUDA
NVIDIA's parallel computing platform and API. The default way to run LLM inference on NVIDIA GPUs.
ROCm
AMD's open-source GPU compute stack, the equivalent of CUDA for AMD cards. Supported by llama.cpp, vLLM, and PyTorch with growing maturity.
Metal
Apple's low-level graphics and compute API. The backend llama.cpp and MLX use to run LLMs on Mac GPUs.
MPS
Metal Performance Shaders. Apple's compute primitives layered on Metal. PyTorch's MPS backend uses these to run on Apple Silicon.
NPU
Neural Processing Unit. A dedicated accelerator for AI workloads, increasingly common on consumer laptop SoCs (Qualcomm, AMD Ryzen AI, Intel Core Ultra, Apple). LLM tooling support is still maturing.
BPW
Bits Per Weight. A continuous measure of quantization precision. EXL2 in particular reports BPW values like 3.5 or 4.65 directly instead of discrete Q-levels.

Missing a term? Email hello@runyard.dev and we will add it.