Runyard is a free hardware-aware AI model browser. You enter your CPU, GPU, and VRAM and it instantly shows every local LLM that will run on your machine, ranked by speed and quality.

How much VRAM do I need to run local LLMs?

8GB of VRAM runs 7B models like Llama 3.1 8B and Mistral 7B at Q4 quantization. 16GB unlocks 13B models. 24GB lets you run Mixtral 8x7B and Llama 3 70B at lower quantization.

What is the best local LLM for my GPU?

Use Runyard at www.runyard.dev — enter your GPU and VRAM and the Model Radar will rank every compatible LLM for your exact hardware, showing estimated tokens per second for each model.

Can I run Llama 3 locally?

Yes. Llama 3.1 8B at Q4 runs on any 8GB VRAM GPU. Llama 3.1 70B needs around 40GB VRAM at Q4, or an Apple Silicon Mac with 64GB+ unified memory.

← Blog/Block Diffusion and DFlash: The Two Ideas Making Local LLMs 6x Faster

May 14, 2026local-ai

Runyard Team

@runyard_dev

13 min read

Contents

▸The Problem: Autoregressive Generation Is Inherently Sequential ▸What Is Block Diffusion (BD3-LM)?▸How BD3-LM Actually Works: Two Nested Loops ▸What Is DFlash? Block Diffusion Meets Speculative Decoding ▸Real Benchmark Numbers ▸Why This Matters for Local AI Specifically ▸When Can You Actually Use This?▸The Bigger Picture: Diffusion Is Coming to LLMs

Block Diffusion and DFlash: The Two Ideas Making Local LLMs 6x Faster

Abstract visualization of parallel token generation streams in a diffusion language model — Block Diffusion generates multiple tokens simultaneously within each block — DFlash uses that property to make speculative decoding dramatically faster.

Every local LLM you run today generates tokens one at a time, left to right, waiting for each one before producing the next. It's how GPT-2 worked in 2019 and it's how Llama 3.3 works today. Two research results — Block Diffusion (ICLR 2025 Oral) and DFlash (February 2026) — attack that bottleneck from different angles and together deliver something remarkable: up to 6x lossless acceleration with no quality loss. If you run models locally and care about tokens per second, these are the papers that matter right now.

The Problem: Autoregressive Generation Is Inherently Sequential

Standard language models are autoregressive (AR): to generate token N, they must have already generated tokens 1 through N-1. This sequential dependency means the model can never truly parallelize generation — every forward pass produces exactly one new token. On a GPU that could be running thousands of operations simultaneously, the typical local inference loop wastes the vast majority of the hardware's capability. The GPU sits idle between token steps. Memory bandwidth gets hit once per token even though the model weights don't change. The bottleneck isn't compute — it's the sequential architecture.

▸Autoregressive models generate one token per forward pass, forcing sequential execution
▸GPU memory bandwidth is re-paid on every step even though model weights are static
▸The KV cache grows with each token, increasing memory traffic over long sequences
▸Speculative decoding helps, but the draft model is also usually autoregressive — it speeds up verification but not drafting

What Is Block Diffusion (BD3-LM)?

Block Diffusion, published by the Kuleshov Group and presented as an Oral at ICLR 2025, takes a fundamentally different approach. Instead of generating tokens one by one, it divides the output sequence into fixed-size blocks and generates all tokens within a block simultaneously using a diffusion process. The full name — Block Discrete Denoising Diffusion Language Model (BD3-LM) — describes exactly what it does: apply discrete (token-space, not continuous) denoising diffusion to blocks of text.

The key insight is that pure diffusion language models (which generate the entire sequence at once through iterative denoising) struggle with arbitrary-length outputs and don't benefit from KV caching. Pure autoregressive models have KV caching and arbitrary length but are sequential. Block Diffusion gets both by operating autoregressively across blocks while running diffusion in parallel within each block.

Generation Paradigm Comparison

Autoregressive (GPT, Llama)

1 tokens parallel

Block Diffusion (block size 4)

4 tokens parallel

Block Diffusion (block size 8)

8 tokens parallel

Block Diffusion (block size 16)

16 tokens parallel

Full Diffusion LM

32 tokens parallel

How BD3-LM Actually Works: Two Nested Loops

Block Diffusion runs two loops at inference time. The outer loop is autoregressive: it generates the sequence block by block, where each block is conditioned on all previous blocks via KV cache — exactly like standard AR inference but at block granularity. The inner loop, which runs inside each block, is a diffusion process: it starts with all tokens in the block masked (unknown) and iteratively denoises them in parallel over a fixed number of steps until they're fully resolved.

1.Start: all tokens in the current block are masked (set to [MASK])
2.Inner diffusion step: model predicts all masked tokens simultaneously in one forward pass
3.Denoising: the highest-confidence predictions are unmasked; lower-confidence ones remain masked
4.Repeat the inner step for a set number of denoising iterations until all tokens in the block are resolved
5.Outer step: the completed block is added to the KV cache; move to the next block
6.Repeat until the sequence is complete

Block size is the key tuning parameter. Larger blocks = more parallelism within a block but potentially lower quality per block. Block size 8 is a strong default — it cuts forward passes per token by 8x while maintaining near-AR quality. Researchers can tune block size to trade off speed vs. quality for their specific use case.

What Is DFlash? Block Diffusion Meets Speculative Decoding

DFlash, published in February 2026 by Jian Chen, Yesheng Liang, and Zhijian Liu at Z Lab, takes Block Diffusion and deploys it as the draft model in a speculative decoding pipeline. Speculative decoding is an already-proven technique: a small, fast "draft" model proposes several tokens ahead, and the large target model verifies them all in a single forward pass. If they match, you get multiple tokens for the cost of one large-model forward pass.

The problem with existing speculative decoding (EAGLE-3, Medusa, etc.) is that the draft model is still autoregressive — it generates candidate tokens one at a time before handing off to the target model. DFlash replaces the autoregressive draft model with a Block Diffusion drafter. Because the drafter generates all candidate tokens in parallel in a single pass, the drafting cost is essentially flat regardless of how many tokens you draft. You can draft 16 or even 32 tokens for nearly the same cost as drafting 1.

Speedup vs. Standard Autoregressive Inference

Standard AR inference (baseline)

EAGLE-3 speculative decoding

2.8x

DFlash (Qwen3.6-27B, code tasks)

5.2x

DFlash (math tasks, peak)

6.1x

DFlash + DDTree (research config)

Real Benchmark Numbers

The DFlash paper reports lossless acceleration — meaning the output distribution is mathematically identical to the target model's output, not an approximation. There's no quality degradation, only speed gains. Here are the concrete numbers from published benchmarks:

▸Qwen3.6-27B (FP8) with 6 speculative tokens: 52.1 tok/s on long code, 54.2 tok/s on math (vs ~9 tok/s standard AR)
▸Qwen3.6-35B-A3B with DDTree budget=22: 48.5 tok/s — a 5.4x improvement
▸Google TPU v5p: average 3.13x speedup across tasks, peak 6x on complex math
▸vs EAGLE-3: DFlash delivers 2.5x more speedup — 2.29x end-to-end serving vs EAGLE-3's 1.30x on TPU v5p
▸GLM-5 on SWE-bench coding: up to 40% latency reduction vs standard inference

DDTree is a companion technique from the same team that adds a draft tree structure on top of DFlash's block diffusion drafting. Instead of a single linear sequence of draft tokens, DDTree branches multiple possible continuations in a tree. The target model verifies all branches simultaneously, selecting the best path. Combined, DFlash + DDTree can approach 8x speedup on favorable tasks.

Why This Matters for Local AI Specifically

Both techniques land differently for local inference than for cloud serving. Cloud providers care about throughput — how many requests per second a cluster handles. Local users care about latency — how fast tokens appear on screen for a single conversation. Block Diffusion's parallel generation collapses the number of model forward passes needed per token, which directly cuts latency. DFlash's speculative decoding dramatically increases effective tokens per second for the kinds of generation local users actually do: code, long explanations, and document writing.

▸Higher tok/s means better real-time responsiveness — words appear as fast as you can read them
▸Block Diffusion draft models are smaller and train faster — the DFlash drafter for Qwen3.6-27B is a fraction of the target size
▸Integration with SGLang and vLLM is already underway — Ollama support is being discussed in the llama.cpp community
▸Apple Silicon port exists (dflash-mlx) — runs on M-series Macs without a discrete GPU
▸Lossless guarantee means you get the exact same quality as running the full model without speculative decoding

When Can You Actually Use This?

As of May 2026, DFlash is available as a research release from Z Lab's GitHub. The SGLang integration is in active development, which means production-grade support for local inference via Ollama or similar tools is likely months away, not years. The MLX port for Apple Silicon already exists and can be used today by technically inclined users.

▸Available now (research): Z Lab GitHub repo — requires manual setup, not plug-and-play
▸Available now (Apple Silicon): dflash-mlx repo for M1/M2/M3/M4 Macs via MLX
▸Coming soon: SGLang integration — track the SGLang GitHub for DFlash PR status
▸Likely 2026 H2: vLLM and Ollama integration, making it accessible without manual setup
▸Block Diffusion (BD3-LM) models: kuleshov-group/bd3lms on GitHub and Hugging Face

terminalbash

# Block Diffusion (BD3-LM) — research code
git clone https://github.com/kuleshov-group/bd3lms
cd bd3lms && pip install -e .

# DFlash — Z Lab research release
git clone https://github.com/z-lab/dflash
cd dflash && pip install -e .

# DFlash on Apple Silicon (MLX)
pip install mlx mlx-lm
git clone https://github.com/Aryagm/dflash-mlx
cd dflash-mlx && python run_dflash.py --model <your-model>

The Bigger Picture: Diffusion Is Coming to LLMs

Block Diffusion and DFlash are not isolated papers — they're part of a broader shift in how researchers think about text generation. Diffusion models dominated image and audio generation because they decoupled the "what" from the "when" of generation: all parts of the output could influence each other during the denoising process rather than being locked to a left-to-right dependency chain. Block Diffusion brings that same decoupling to language, and DFlash shows that even traditional autoregressive models benefit from diffusion-based thinking in their draft pipelines.

The trajectory is clear: the 6x speedup DFlash demonstrates today is a floor, not a ceiling. DDTree pushes it toward 8x. Better draft model training will push acceptance rates higher. Hardware co-design (Blackwell's MXFP4 native support accelerates diffusion denoising steps) will push it further. If you're building local AI tooling or picking hardware for a home server today, these are the efficiency gains that will matter over the next 12-18 months.

Check how much faster your current GPU can run your models — including projected gains from speculative decoding — on the Runyard Model Radar.

Open the Runyard Model Radar → →

March 18, 2026

Try Runyard

Find AI models that fit your exact hardware. Enter your specs and get a ranked list instantly.

Newsletter