Runyard is a free hardware-aware AI model browser. You enter your CPU, GPU, and VRAM and it instantly shows every local LLM that will run on your machine, ranked by speed and quality.

How much VRAM do I need to run local LLMs?

8GB of VRAM runs 7B models like Llama 3.1 8B and Mistral 7B at Q4 quantization. 16GB unlocks 13B models. 24GB lets you run Mixtral 8x7B and Llama 3 70B at lower quantization.

What is the best local LLM for my GPU?

Use Runyard at www.runyard.dev — enter your GPU and VRAM and the Model Radar will rank every compatible LLM for your exact hardware, showing estimated tokens per second for each model.

Can I run Llama 3 locally?

Yes. Llama 3.1 8B at Q4 runs on any 8GB VRAM GPU. Llama 3.1 70B needs around 40GB VRAM at Q4, or an Apple Silicon Mac with 64GB+ unified memory.

← Blog/How to Run Llama 3 Locally: Complete Step-by-Step Guide

March 8, 2026guide

Runyard Team

@runyard_dev

10 min read

Contents

▸Step 1: Pick Your Model Size ▸Step 2: Install Ollama ▸Step 3: Pull and Run Llama 3.1 ▸Step 4: Use It as an API ▸Step 5: Add a Chat UI (Optional)▸Troubleshooting Common Issues

How to Run Llama 3 Locally: Complete Step-by-Step Guide

Llama 3.1 is Meta's best open-source model family — competitive with GPT-4o on many benchmarks and completely free to run locally. This guide uses Ollama, the simplest way to get Llama running on Windows, macOS, or Linux.

Step 1: Pick Your Model Size

▸Llama 3.1 8B — 8GB VRAM. Best for most users. Fast, capable, general-purpose.
▸Llama 3.1 70B — 40GB VRAM at Q4. Significantly better at reasoning and coding.
▸Llama 3.1 405B — 200GB+ VRAM. Research/enterprise grade. Most people cannot run this.
▸Llama 3.2 3B — 4GB VRAM. For very limited hardware or embedded use.

Step 2: Install Ollama

terminalbash

# macOS / Linux
curl -fsSL https://ollama.ai/install.sh | sh

# Windows — download the installer from:
# https://ollama.ai/download/windows

# Verify installation
ollama --version

Step 3: Pull and Run Llama 3.1

terminalbash

# Pull the 8B model (~4.7GB download)
ollama pull llama3.1:8b

# Start chatting immediately
ollama run llama3.1:8b

# Or pull the 70B model (if you have the hardware)
ollama pull llama3.1:70b

The first run downloads the model. Subsequent runs start in under 5 seconds because the model is cached locally. Use Ctrl+D or type /bye to exit the chat.

Step 4: Use It as an API

Ollama runs a local server on port 11434 with an OpenAI-compatible API. This means you can use it with any tool that supports the OpenAI SDK.

chat.pypython

from openai import OpenAI

# Point the OpenAI client at your local Ollama server
client = OpenAI(
    base_url='http://localhost:11434/v1',
    api_key='ollama',  # required but ignored
)

response = client.chat.completions.create(
    model='llama3.1:8b',
    messages=[
        {'role': 'system', 'content': 'You are a helpful assistant.'},
        {'role': 'user', 'content': 'Explain transformers in 3 sentences.'},
    ]
)

print(response.choices[0].message.content)

Step 5: Add a Chat UI (Optional)

terminalbash

# Open WebUI — full ChatGPT-like interface for Ollama
docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

# Open http://localhost:3000 in your browser

Troubleshooting Common Issues

▸Model runs on CPU (very slow) — Check nvidia-smi to confirm GPU is detected. Reinstall CUDA drivers if needed.
▸Out of memory error — Switch to a smaller quantization: ollama pull llama3.1:8b-q4_0
▸Slow first response — Normal. First response loads model into VRAM. Subsequent responses are fast.
▸Port 11434 already in use — Another Ollama instance is running. Kill it with: pkill ollama

Not sure which Llama variant to pick for your hardware? Visit www.runyard.dev — enter your GPU and VRAM, and the Model Radar will recommend the right Llama 3.1 size and quantization for your setup instantly.

June 18, 2026

VRAM Calculator8 GB

2 GB96 GB

Llama 3.1 8B Q8

Chat8GB

CodeLlama 13B

Code8GB

Phi-3 Medium 14B

Chat7.5GB

Gemma 2 9B

Chat5.5GB

Llama 3.1 8B

Chat5GB

LLaVA 1.6 7B

Vision5GB

Qwen 2.5 7B

Chat4.8GB

Mistral 7B

Chat4.5GB

DeepSeek Coder 6.7B

Code4.2GB

Phi-3 Mini 3.8B

Chat2.5GB

Gemma 2 2B

Chat2GB

TinyLlama 1.1B

Chat1GB

12 models fit in 8GB

Newsletter

How to Run Llama 3 Locally: Complete Step-by-Step Guide

Step 1: Pick Your Model Size

Step 2: Install Ollama

Step 3: Pull and Run Llama 3.1

Step 4: Use It as an API

Step 5: Add a Chat UI (Optional)

Troubleshooting Common Issues

Q4_K_M vs Q5_K_M vs Q6_K vs Q8_0: How Much Quality Do You Lose?

Cloud GPU Pricing in India 2026 — Vast.ai vs RunPod vs Salad vs E2E vs Yotta

How Local LLM Inference Actually Works: Loading, Memory, and Quantization Explained