Contents
Tags
Llama 3.1 is Meta's best open-source model family — competitive with GPT-4o on many benchmarks and completely free to run locally. This guide uses Ollama, the simplest way to get Llama running on Windows, macOS, or Linux.
# macOS / Linux
curl -fsSL https://ollama.ai/install.sh | sh
# Windows — download the installer from:
# https://ollama.ai/download/windows
# Verify installation
ollama --version# Pull the 8B model (~4.7GB download)
ollama pull llama3.1:8b
# Start chatting immediately
ollama run llama3.1:8b
# Or pull the 70B model (if you have the hardware)
ollama pull llama3.1:70bThe first run downloads the model. Subsequent runs start in under 5 seconds because the model is cached locally. Use Ctrl+D or type /bye to exit the chat.
Ollama runs a local server on port 11434 with an OpenAI-compatible API. This means you can use it with any tool that supports the OpenAI SDK.
from openai import OpenAI
# Point the OpenAI client at your local Ollama server
client = OpenAI(
base_url='http://localhost:11434/v1',
api_key='ollama', # required but ignored
)
response = client.chat.completions.create(
model='llama3.1:8b',
messages=[
{'role': 'system', 'content': 'You are a helpful assistant.'},
{'role': 'user', 'content': 'Explain transformers in 3 sentences.'},
]
)
print(response.choices[0].message.content)# Open WebUI — full ChatGPT-like interface for Ollama
docker run -d -p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui \
ghcr.io/open-webui/open-webui:main
# Open http://localhost:3000 in your browserNot sure which Llama variant to pick for your hardware? Visit runyard.dev — enter your GPU and VRAM, and the Model Radar will recommend the right Llama 3.1 size and quantization for your setup instantly.
Tools
Newsletter