← Blog/How to Run Llama 3 Locally: Complete Step-by-Step Guide
guide
Runyard Team
@runyard_dev
10 min read

Tags

#llama3#ollama#local-llm#guide#setup

How to Run Llama 3 Locally: Complete Step-by-Step Guide

Llama 3.1 is Meta's best open-source model family — competitive with GPT-4o on many benchmarks and completely free to run locally. This guide uses Ollama, the simplest way to get Llama running on Windows, macOS, or Linux.

Step 1: Pick Your Model Size

  • Llama 3.1 8B — 8GB VRAM. Best for most users. Fast, capable, general-purpose.
  • Llama 3.1 70B — 40GB VRAM at Q4. Significantly better at reasoning and coding.
  • Llama 3.1 405B — 200GB+ VRAM. Research/enterprise grade. Most people cannot run this.
  • Llama 3.2 3B — 4GB VRAM. For very limited hardware or embedded use.

Step 2: Install Ollama

terminalbash
# macOS / Linux
curl -fsSL https://ollama.ai/install.sh | sh

# Windows — download the installer from:
# https://ollama.ai/download/windows

# Verify installation
ollama --version

Step 3: Pull and Run Llama 3.1

terminalbash
# Pull the 8B model (~4.7GB download)
ollama pull llama3.1:8b

# Start chatting immediately
ollama run llama3.1:8b

# Or pull the 70B model (if you have the hardware)
ollama pull llama3.1:70b

The first run downloads the model. Subsequent runs start in under 5 seconds because the model is cached locally. Use Ctrl+D or type /bye to exit the chat.

Step 4: Use It as an API

Ollama runs a local server on port 11434 with an OpenAI-compatible API. This means you can use it with any tool that supports the OpenAI SDK.

chat.pypython
from openai import OpenAI

# Point the OpenAI client at your local Ollama server
client = OpenAI(
    base_url='http://localhost:11434/v1',
    api_key='ollama',  # required but ignored
)

response = client.chat.completions.create(
    model='llama3.1:8b',
    messages=[
        {'role': 'system', 'content': 'You are a helpful assistant.'},
        {'role': 'user', 'content': 'Explain transformers in 3 sentences.'},
    ]
)

print(response.choices[0].message.content)

Step 5: Add a Chat UI (Optional)

terminalbash
# Open WebUI — full ChatGPT-like interface for Ollama
docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  ghcr.io/open-webui/open-webui:main

# Open http://localhost:3000 in your browser

Troubleshooting Common Issues

  • Model runs on CPU (very slow) — Check nvidia-smi to confirm GPU is detected. Reinstall CUDA drivers if needed.
  • Out of memory error — Switch to a smaller quantization: ollama pull llama3.1:8b-q4_0
  • Slow first response — Normal. First response loads model into VRAM. Subsequent responses are fast.
  • Port 11434 already in use — Another Ollama instance is running. Kill it with: pkill ollama

Not sure which Llama variant to pick for your hardware? Visit runyard.dev — enter your GPU and VRAM, and the Model Radar will recommend the right Llama 3.1 size and quantization for your setup instantly.

RUNYARD.DEV

Hardware-aware AI model discovery. Know exactly what runs on your machine — before you download.

© 2026 RUNYARD.DEV — All rights reserved.

Built for local AI.

Tools

VRAM Calculator8 GB
2 GB96 GB
Llama 3.1 8B Q8
Chat8GB
CodeLlama 13B
Code8GB
Phi-3 Medium 14B
Chat7.5GB
Gemma 2 9B
Chat5.5GB
Llama 3.1 8B
Chat5GB
LLaVA 1.6 7B
Vision5GB
Qwen 2.5 7B
Chat4.8GB
Mistral 7B
Chat4.5GB
DeepSeek Coder 6.7B
Code4.2GB
Phi-3 Mini 3.8B
Chat2.5GB
Gemma 2 2B
Chat2GB
TinyLlama 1.1B
Chat1GB
12 models fit in 8GB

Newsletter