Runyard is a free hardware-aware AI model browser. You enter your CPU, GPU, and VRAM and it instantly shows every local LLM that will run on your machine, ranked by speed and quality.

How much VRAM do I need to run local LLMs?

8GB of VRAM runs 7B models like Llama 3.1 8B and Mistral 7B at Q4 quantization. 16GB unlocks 13B models. 24GB lets you run Mixtral 8x7B and Llama 3 70B at lower quantization.

What is the best local LLM for my GPU?

Use Runyard at www.runyard.dev — enter your GPU and VRAM and the Model Radar will rank every compatible LLM for your exact hardware, showing estimated tokens per second for each model.

Can I run Llama 3 locally?

Yes. Llama 3.1 8B at Q4 runs on any 8GB VRAM GPU. Llama 3.1 70B needs around 40GB VRAM at Q4, or an Apple Silicon Mac with 64GB+ unified memory.

RUNYARD.DEV / BLOG

Latest Updates

Guides, deep-dives, and news on local AI models, hardware, and the Runyard platform.

May 6, 2026local-ai

Qwen3.5-9B Beats OpenAI's 120B Model — While Running on Your Laptop

Alibaba's Qwen3.5-9B outperforms GPT-OSS-120B on GPQA Diamond, MMLU-Pro, and MMMU-Pro benchmarks while running on any laptop with 16GB of RAM. Here's what changed in the architecture, and how to run it today.

Read more →12 min read

May 6, 2026guide

Turn Any PC Into a Home AI Server: The Complete 2026 Setup Guide

Run Ollama + Open WebUI on a dedicated machine and give every device in your house instant access to private AI — no cloud, no subscriptions, no data leaving your network.

Running Llama 4 Scout Locally: The Complete Hardware Guide

Meta's MoE flagship has 109B total parameters but only activates 17B per token during inference. Here's exactly what GPU you need — and how to hit conversational speed on a single 24 GB card.

Read more →12 min read

May 4, 2026news

Ling-2.6-1T Is Now Open Source: Ant Group's Trillion-Parameter MoE Goes Public

InclusionAI just open-sourced Ling-2.6-1T under MIT — a 1T-parameter MoE with 63B active parameters, a 262K context window, and LiveCodeBench scores that beat GPT-5 by 13 points. Here's what it is, how it works, and what hardware you actually need.

Read more →12 min read

May 3, 2026local-ai

Qwen3-Coder-Next: The Local Coding Agent That Punches 10x Above Its Weight

Alibaba's newest coding model activates only 3B parameters from 80B total — and still beats models 10–20x larger on SWE-Bench Pro. Here's how to run it locally and why it matters for anyone building with local AI.

Read more →12 min read

May 2, 2026local-ai

Qwen3.5 MoE: 60 tok/s on a Laptop — How to Run the 122B Model Locally

Alibaba's Qwen3.5-122B-A10B reaches 60 tok/s on the M5 Max with 73.8 GB peak memory — entirely on battery. Here's how the sparse expert architecture makes this possible and how to run it on your hardware.

Read more →12 min read

May 1, 2026news

Xiaomi MiMo-V2.5: Open-Source Agentic AI With a 1-Million-Token Context Window

MiMo-V2.5 is Xiaomi's fully open-source 310B MoE model with just 15B active parameters per token — built for multimodal agentic coding, long-horizon reasoning, and real-world task completion across text, image, video, and audio.

The AI Subscription Trap: Why Running Local Actually Wins the Math

ChatGPT Plus, Claude Pro, Cursor, Copilot — they're each "only" $10–60/month. But stack a few together and do the 3-year math, and consumer GPU hardware starts looking very smart. Here's the honest cost comparison.

Read more →12 min read

May 1, 2026deep-dive

How DeepSeek V4 Fits 1 Million Tokens Into Your GPU: CSA and HCA Explained

DeepSeek V4 uses only 10% of the KV cache that V3.2 needed at 1M token context — by compressing across tokens rather than heads. Here's a clear breakdown of how Compressed Sparse Attention and Heavily Compressed Attention actually work.

Qwen3.6: The 27B Model That Outperforms a 397-Billion-Parameter Giant on Code

Alibaba's Qwen3.6-27B scores 77.2% on SWE-bench Verified — matching Claude 4.5 Opus on coding tasks — while the sibling 35B-A3B MoE activates just 3B parameters per token. Both run on consumer hardware today.

How Local LLM Inference Actually Works: Loading, Memory, and Quantization Explained

You downloaded a model — now what? A plain-English walkthrough of how inference engines load weights, manage memory with MMAP, and why quantization format matters more than programming language.

Read more →12 min read

April 29, 2026news

DeepSeek V4 Just Dropped: Here's What It Actually Means for Local AI

DeepSeek V4 Pro (1.6T params, 49B active) and V4 Flash (284B/13B) are open-weight and MIT-licensed. Most people can't run them today — here's why they still matter, and what's coming next.

Run Gemma 4 Locally for Free: Google's Best Open AI Model on Your Own Machine

Google's Gemma 4 runs completely offline — no subscriptions, no data leaving your machine. Here's exactly how to set it up with Ollama in under 10 minutes.

Read more →10 min read

March 31, 2026research

How to Run TurboQuant: We Tested Google's Algorithm on a Consumer GPU

We built a ground-up Python implementation of TurboQuant and tested it on Qwen 2.5 3B running on an RTX 3060. 5.3x KV cache compression. 99.5% attention fidelity at 3-bit. Here's exactly what happened.

Read more →12 min read

March 31, 2026product

Introducing Runyard Compare: Head-to-Head GPU Benchmarks with TurboQuant

Runyard just shipped a new Compare page — pick any two devices, toggle TurboQuant, and instantly see which GPU wins for every LLM. Here's what's new and why TurboQuant changes the result.

How Much of a Boost Does TurboQuant Actually Give Your GPU?

TurboQuant compresses the KV cache 4–5×. But the real question is: how does that change the composite score for your specific GPU and model? We ran the numbers. The answer depends on your VRAM.

We Built a Hardware MCP Server: Auto-Detect Your GPU and Get the Right LLM

@runyarddev/hw-mcp is a free MCP server that auto-detects your GPU, VRAM, and RAM, then ranks 900+ LLMs by hardware fit — giving any MCP-compatible AI agent instant hardware awareness.

Read more →10 min read

March 27, 2026local-ai

Google TurboQuant: Your Local LLM Just Got a 4x Context Window Upgrade

Google's TurboQuant research (Zandieh et al., ICLR 2026) cuts KV cache memory by 4x — meaning the 7B model on your GPU can now run 32K context where it used to max out at 8K. Here's what changes today.

Read more →8 min read

March 25, 2026models

LFM-2.5VL-1.6B: The Vision Model That Runs in Your Browser

LFM-2.5VL-1.6B runs fully in Chrome, Edge, and Firefox via WebGPU — no server, no API key, no data leaving your device. Here's how to use it today.

Read more →10 min read

March 25, 2026guide

Can I Run Image Models Locally on My Computer?

Yes — easier than you think. From LLaVA on an 8GB GPU to MiniCPM-V on a MacBook, every local vision model worth running in 2026 and which fits your hardware.

Read more →9 min read

March 25, 2026security

Have You Been Pawned by LiteLLM?

LiteLLM versions 1.82.7 and 1.82.8 contained a credential-stealing backdoor. Paste your pip freeze output into our checker to find out if your machine was compromised — and what to do next.

LiteLLM Supply Chain Attack: What Happened and How to Fix It

On March 24, 2026, LiteLLM versions 1.82.7 and 1.82.8 were poisoned with a three-stage credential stealer targeting API keys, SSH keys, cloud credentials, and crypto wallets. Here's exactly what happened and what to do now.

Read more →9 min read

March 24, 2026guide

Find the Best Local LLM for Your PC (2026 Guide)

Stop downloading models that don't fit. Match an open-source LLM to your GPU, RAM, and use case — then run it in minutes with Ollama or LM Studio.

Read more →10 min read

March 24, 2026runyard

How to Use Runyard — Step-by-Step Explainer

Runyard shows which AI models fit your hardware and how fast they run. Complete walkthrough of every feature — from entering GPU specs to clicking Analyze.

"Claude Code for Free" via OpenRouter — The Real Truth

"Claude Code for free via OpenRouter" is everywhere on YouTube. It's not a lie — but not the full truth. Here's what you're actually getting and the real free alternative.

Read more →8 min read

March 18, 2026hardware

How Much VRAM Do You Need to Run Local LLMs?

8GB gets you started, 16GB runs most models comfortably, 24GB+ unlocks the best open-source models. Full breakdown by model size and quantization.

Read more →8 min read

March 15, 2026local-ai

Best Local LLMs for Coding in 2026

DeepSeek Coder V2, Qwen2.5 Coder 32B, and CodeLlama 70B lead the pack — but the right choice depends on your VRAM. Full ranked list with benchmarks.

Read more →9 min read

March 12, 2026comparison

Ollama vs LM Studio: Which Should You Use in 2026?

Ollama is faster to set up, perfect for developers. LM Studio has a polished GUI and better model discovery. The right choice depends on how you work.

What LLMs Can You Run with 8GB VRAM?

An RTX 3070, 4060, or any 8GB GPU can run Llama 3.1 8B, Mistral 7B, DeepSeek Coder 6.7B, and more — all fit comfortably at Q4 quantization.

How to Run Llama 3 Locally: Complete Step-by-Step Guide

Meta's Llama 3.1 is one of the best open-source LLMs. This guide walks you through every step — from picking the right size to running it in minutes.

Read more →10 min read

March 5, 2026comparison

Local LLM vs OpenAI API: The True Cost Comparison (2026)

Local LLMs have upfront hardware costs but near-zero marginal cost. OpenAI API charges per token. We break down exactly when running local wins.

Read more →8 min read

March 1, 2026hardware

Best GPU for Running Local LLMs in 2026

RTX 4090 is the gold standard, RTX 4070 Ti the best value, Apple Silicon the wildcard. Every major GPU ranked for local LLM inference with tok/s data.

Read more →9 min read

February 28, 2026runyard

Introducing Runyard: Find AI Models for Your Hardware

Runyard tells you exactly which AI models run on your computer and how fast. No more guessing, no more downloading 20GB models that don't fit.

Read more →8 min read