Contents
Tags
Google Research published TurboQuant in early 2026 — a two-stage KV cache compression algorithm claiming 6x memory reduction with zero accuracy loss. We read the papers, implemented every component from scratch in Python using Claude Code (Opus 4.6), and ran a full validation on a real language model (Qwen 2.5 3B) on a consumer RTX 3060. At 3-bit quantization we measured 5.3x KV cache compression with 99.5% attention fidelity. Here is exactly how we did it and what we found.
AI models like Llama, Qwen, and Mistral are massive — storing billions of numbers that demand expensive, specialized hardware. But the model weights are only half the story. During a conversation, every transformer model maintains a key-value (KV) cache: a running memory of everything said so far. Think of it as the AI's cheat sheet. Keys label what each token was about; values store the actual content. The problem: this cache grows with every message, consuming gigabytes of VRAM alongside the model itself.
On a 12GB RTX 3060 running a 7B model, a 32K-token context can consume as much VRAM as the model weights themselves — making long conversations impossible without hitting the memory ceiling. TurboQuant attacks this directly: compress the KV cache without touching the model weights, no retraining required.
TurboQuant operates in two stages that work together. Stage one (PolarQuant) reorganizes the KV data into a simpler geometric shape — think of it like folding clothes neatly before packing a suitcase. This alone compresses well, but introduces a systematic bias. Stage two (QJL — Quantized Johnson-Lindenstrauss) applies a randomized projection that captures the fine details stage one misses, correcting the bias at the cost of just one extra bit per number.
Why two stages? Stage one alone skews the AI's inner-product comparisons — the math it uses to decide which token to attend to next. QJL is the automatic color-correction that undoes the shift. At 3-bit, the combined result compresses 32 bits down to 3, roughly a 10x reduction per value.
Google has not released an official TurboQuant implementation. We built everything from the papers using Claude Code (Opus 4.6) as a Python package. The four components: (1) Lloyd-Max codebook — the optimal quantization dictionary; (2) TurboQuant MSSE — stage one PolarQuant distortion bounds; (3) TurboQuant QJL — the bias-correction projection; (4) KV cache wrapper — integrating the full pipeline into a real transformer inference loop.
import torch
from turboquant import LloydMaxCodebook, PolarQuant, QJLCorrection
class TurboQuantKVCache:
"""Drop-in KV cache replacement with 2/3/4-bit TurboQuant compression."""
def __init__(self, bits: int = 3):
self.bits = bits
self.codebook = LloydMaxCodebook(bits=bits)
self.polar = PolarQuant(codebook=self.codebook)
self.qjl = QJLCorrection(bits=bits)
self._cache: dict = {}
def update(self, layer_idx: int, key: torch.Tensor, value: torch.Tensor):
k_compressed = self.qjl.correct(self.polar.compress(key))
v_compressed = self.qjl.correct(self.polar.compress(value))
self._cache[layer_idx] = (k_compressed, v_compressed)
def retrieve(self, layer_idx: int):
k, v = self._cache[layer_idx]
return self.polar.decompress(k), self.polar.decompress(v)The first validation: the codebook — the dictionary the compressor uses to translate floating-point numbers to bit codes — must be perfectly symmetric around zero. An asymmetric codebook favors certain values, introducing systematic errors into every compression. Our symmetry check returned exactly zero deviation. MSE distortion stayed well within the theoretical bounds described in the paper across all bit widths.
When an AI picks the next word, it compares token vectors using inner products. If compression skews those comparisons consistently in one direction, outputs degrade in ways that are hard to detect without careful measurement. The QJL correction must make those inner products unbiased — meaning the compressed versions agree with the uncompressed versions on average, with no systematic drift.
Our bias column across all tested configurations came out near zero everywhere. Even at aggressive 2-bit compression, the compressed inner products agreed with the originals 92% of the time by correlation. This means the AI's decision-making is not meaningfully skewed by the compression.
We built a full KV cache wrapper and measured real memory savings. At 3-bit — the sweet spot we'd later validate on a real model — we achieved 5.22x compression. The paper reports 6x in ideal conditions; our pure Python implementation matches within margin given no CUDA optimization.
This test proves compression does not lose critical information. We hid a specific fact among 8,192 other tokens, compressed the entire KV cache down to 2-bit, 3-bit, and 4-bit, then queried retrieval accuracy. All nine retrieval attempts across all three bit widths returned exact matches — 100% retrieval accuracy. Information that matters is preserved even under aggressive compression.
The math validation passed. The real test was harder: does TurboQuant preserve model behavior on an actual transformer running on consumer hardware? We loaded Qwen 2.5 3B in 4-bit weights (fitting in 2GB VRAM), fed it a long document with a hidden fact, ran a single forward pass to capture real KV cache tensors, then compressed at 2-bit, 3-bit, and 4-bit and measured attention score correlation against the uncompressed baseline.
The original KV cache measured 289MB. At 4-bit TurboQuant: 76MB (3.8x smaller). At 3-bit: 58MB (5x smaller). At 2-bit: 39MB (7.3x smaller). These numbers directly match the paper's theoretical compression ratios — the math works on real hardware.
Compression ratios are meaningless if the model gives worse answers. We measured attention score correlation — how closely the compressed model's internal attention patterns match the uncompressed original. If attention is preserved, output quality is preserved.
At 3-bit, attention correlation across all 36 transformer layers averaged 0.995 — 99.5% fidelity to the uncompressed model. Even at 8K context length, it stayed above 0.994. For top-5 token selection: 92% of the time, the model's top token choice at 3-bit was still in the top-5 under uncompressed attention. The model is functionally equivalent in practice.
3-bit is the practical sweet spot: 5x compression, 99.5% attention fidelity, generation quality nearly indistinguishable in practice. 2-bit pushes to 7.3x compression but drops top-1 token match to ~66% — noticeable in long-form generation.
On our RTX 3060 (12GB VRAM), TurboQuant at 3-bit changes context capacity dramatically. A KV cache that previously consumed 289MB for 8K context now takes 58MB. That frees nearly 10GB of headroom. Instead of being capped at 8K context with a 7B model, you can push to 40K context on the same hardware — without buying anything new.
Google has not released official TurboQuant code. Our Python implementation validates the math but is not speed-optimized — we measured memory savings only, not inference throughput. The paper reports 8x speed improvements on H100s; those gains require CUDA kernel-level implementation. Community forks for llama.cpp are in early progress as of March 2026. When they ship, speed gains will come alongside the memory wins we've already validated.
See exactly which models fit your GPU today — and how much headroom TurboQuant will unlock when it ships.
Check your hardware on Runyard →Tools
Find AI models that fit your exact hardware. Enter your specs and get a ranked list instantly.
Newsletter