Reference
Local AI glossary.
Plain-English definitions for the terms that come up when running language models locally. From quantization formats (GGUF, AWQ, EXL2) to inference engines (llama.cpp, Ollama, vLLM) to the model architectures themselves.
- GPU
- Graphics Processing Unit. Originally for rendering graphics; now the standard accelerator for AI inference because its thousands of parallel cores match the matrix math LLMs do.
- VRAM
- Video RAM — the high-bandwidth memory soldered onto a GPU. The single biggest constraint on which local LLMs you can run, because the entire model usually has to fit inside it.
- GGUF
- GPT-Generated Unified Format. The current standard file format for quantized LLM weights, designed for llama.cpp. Replaces the older GGML format and stores metadata, tokenizer, and quantized tensors in one file.
- AWQ
- Activation-aware Weight Quantization. A quantization scheme that protects salient weights based on activation magnitudes. Popular for GPU inference via vLLM and Aphrodite.
- EXL2
- ExLlamaV2 quantization format. A GPU-only quant format that supports per-tensor bit widths (e.g. 3.5 bpw, 4.65 bpw). Common on consumer NVIDIA GPUs via the ExLlamaV2 engine.
- GPTQ
- A post-training quantization method that minimizes layerwise reconstruction error. One of the earliest 4-bit schemes, still widely supported but largely superseded by AWQ and EXL2 for GPU and GGUF for CPU/GPU hybrid.
- MLX
- Apple's array framework for Apple Silicon. The native way to run LLMs on M1/M2/M3/M4 Macs with unified memory. Comparable to PyTorch but designed around the Metal Performance Shaders backend.
- Q4_K_M
- A 4-bit GGUF quantization variant that uses K-quant blocks with mixed precision inside each block. The most common default for local inference — best balance of size, speed, and quality for most users.
- Q5_K_M
- A 5-bit GGUF K-quant variant. Slightly larger and slower than Q4_K_M but closer to FP16 quality. Worth it on coding and reasoning workloads when VRAM allows.
- Q6_K
- A 6-bit GGUF K-quant variant. Near-FP16 quality at a meaningful size reduction. Common choice for users who want to minimize quality loss on a tight VRAM budget.
- Q8_0
- An 8-bit GGUF quantization with no K-quant grouping. About 99% of FP16 quality at half the size. The conservative default for users who can afford the VRAM.
- FP16
- Half-precision floating-point (16-bit). The native precision for most LLMs before quantization. A 7B model in FP16 needs roughly 14 GB of memory just for weights.
- BF16
- Brain Float 16. Same 16-bit width as FP16 but with FP32-range exponent. Preferred for training because it avoids the overflow problems of FP16. Most modern open-source models are released in BF16.
- INT8
- 8-bit signed integer. A weight precision used by some quantization schemes (e.g. bitsandbytes int8). Roughly half the memory of FP16 with minimal quality loss.
- INT4
- 4-bit signed integer. The most aggressive widely-used quantization precision. Cuts model size by ~4x vs FP16 with measurable but usually acceptable quality loss.
- Tokens per second
- The throughput metric for LLM inference. How many tokens the model can generate per second on your hardware. Above 20 tok/s feels interactive; below 10 tok/s feels slow.
- KV cache
- Key/Value cache. The memory used to store attention keys and values for every token already in the context. Grows linearly with context length and is often the hidden reason long-context runs OOM.
- Context length
- The maximum number of tokens a model can attend to at once, counting prompt plus generated output. Modern open models range from 8K to 128K+ tokens.
- Quantization
- The process of compressing model weights from FP16/BF16 down to lower-precision formats (INT8, INT4, or further) to reduce memory and speed up inference. Trades a small amount of quality for a large amount of size.
- Inference
- Running a trained model to generate output, as opposed to training it. Local LLM tooling is almost entirely about inference.
- MoE
- Mixture of Experts. A model architecture where only a subset of "expert" sublayers activates per token. Mixtral 8x7B and DeepSeek-V2 are MoE: bigger total parameter counts than they "use" per token.
- Active parameters
- For MoE models, the number of parameters actually used to predict each token. A 47B MoE may only use 13B active parameters, which is what governs inference speed.
- Latency
- Time-to-first-token. The delay between sending a prompt and seeing the first output token. Distinct from throughput (tokens/sec), and dominated by prompt processing time for long inputs.
- Throughput
- Tokens generated per unit time, usually tokens per second. The headline number on most local-LLM benchmarks.
- llama.cpp
- A C/C++ inference engine for LLMs. The reference implementation for GGUF and the engine under Ollama, LM Studio, GPT4All, Jan, and most local-AI desktop apps.
- Ollama
- A wrapper around llama.cpp that adds a simple CLI, a model registry, and an OpenAI-compatible HTTP server. The most popular way to run local LLMs.
- LM Studio
- A desktop application for downloading and chatting with local LLMs, built on llama.cpp. Strong GUI for users who do not want a terminal.
- GPT4All
- A cross-platform desktop app and model ecosystem for local LLMs, maintained by Nomic. Optimized for running models on CPU as well as GPU.
- Jan
- An open-source local AI chat app built on llama.cpp, with an OpenAI-compatible API. Often used as a self-hosted alternative to LM Studio.
- vLLM
- A GPU inference server optimized for high-throughput serving. Implements PagedAttention to manage KV cache like virtual memory. Standard in production deployments.
- Aphrodite
- An inference engine derived from vLLM with extended quantization support (AWQ, GPTQ, EXL2, GGUF) and additional sampling features. Popular in the open-source LLM hosting community.
- Llama
- Meta's family of open-weight LLMs (Llama 2, Llama 3, Llama 3.1, Llama 3.2, Llama 3.3). The dominant base architecture for the local-AI ecosystem.
- Mistral
- Mistral AI's family of open-weight models (Mistral 7B, Mixtral 8x7B/8x22B, Mistral Small/Large). Best-known for strong performance per parameter and an early MoE release.
- Qwen
- Alibaba's open-weight model family. Qwen 2.5 and Qwen 3 are competitive with frontier closed models on many benchmarks and have permissive licenses.
- Gemma
- Google's family of open-weight models distilled from the Gemini line. Gemma 2 and Gemma 3 are popular small/medium model choices for local inference.
- Phi
- Microsoft Research's small-model family, optimized for reasoning per parameter. Phi-3 and Phi-4 are the typical "runs on integrated graphics" picks.
- DeepSeek
- A Chinese open-weight model family, including the DeepSeek-V2/V3 MoE chat models and DeepSeek-R1 reasoning model. Strong on code and math.
- RAG
- Retrieval-Augmented Generation. A pattern where a system retrieves relevant documents from a knowledge base and includes them in the LLM prompt, instead of relying on the model's parametric memory alone.
- Embedding
- A dense vector representation of a piece of text. Used in RAG, semantic search, and clustering. Generated by a dedicated embedding model rather than a generative LLM.
- Transformer
- The neural network architecture underlying every modern LLM. Built around self-attention, residual connections, and feed-forward layers stacked many times.
- Attention
- The mechanism by which a transformer mixes information across tokens. Each token computes a weighted sum over every other token's value vector, weighted by query/key similarity.
- Multi-head attention
- Running attention multiple times in parallel with different learned projections, then concatenating the results. Lets the model attend to different patterns simultaneously.
- Flash Attention
- An IO-aware attention algorithm that fuses operations and avoids materializing the full attention matrix. Standard for modern GPU inference; faster and uses less VRAM than naive attention.
- RoPE
- Rotary Position Embedding. A way of encoding token position by rotating query/key vectors in 2D subspaces. Used by Llama, Mistral, Qwen, and most modern open-source LLMs.
- SwiGLU
- Swish-Gated Linear Unit. The activation function used in the feed-forward layers of Llama and most modern open-source LLMs. Outperforms ReLU and GELU at scale.
- RMSNorm
- Root Mean Square Layer Normalization. A simpler, slightly faster alternative to LayerNorm. Used by Llama and many of its descendants.
- Layer norm
- A normalization that scales activations to zero mean and unit variance across the feature dimension. The original normalization used in transformers; largely replaced by RMSNorm in modern open models.
- Mixed precision
- Running parts of a model in FP16/BF16 and other parts in FP32 for stability. Common during training; less relevant for pure inference but still used in some quantization schemes.
- TensorRT-LLM
- NVIDIA's production inference engine for LLMs. Compiles a model to a hardware-specific TensorRT engine for maximum throughput on NVIDIA GPUs.
- CUDA
- NVIDIA's parallel computing platform and API. The default way to run LLM inference on NVIDIA GPUs.
- ROCm
- AMD's open-source GPU compute stack, the equivalent of CUDA for AMD cards. Supported by llama.cpp, vLLM, and PyTorch with growing maturity.
- Metal
- Apple's low-level graphics and compute API. The backend llama.cpp and MLX use to run LLMs on Mac GPUs.
- MPS
- Metal Performance Shaders. Apple's compute primitives layered on Metal. PyTorch's MPS backend uses these to run on Apple Silicon.
- NPU
- Neural Processing Unit. A dedicated accelerator for AI workloads, increasingly common on consumer laptop SoCs (Qualcomm, AMD Ryzen AI, Intel Core Ultra, Apple). LLM tooling support is still maturing.
- BPW
- Bits Per Weight. A continuous measure of quantization precision. EXL2 in particular reports BPW values like 3.5 or 4.65 directly instead of discrete Q-levels.
Missing a term? Email hello@runyard.dev and we will add it.