EngineeringMay 30, 20269 min read

TurboQuant comes to llama.cpp: 2-bit and 3-bit KV cache compression

We forked llama.cpp to bring Google Research's TurboQuant to the KV cache — shrinking it ~5× at 3-bit while staying quality-neutral, with measured wins on Qwen.

Long context is a memory problem before it is a compute problem. Once a model is loaded, the thing that grows with every token you feed it — and every token it generates — is the KV cache: the stored keys and values for attention. At 32K tokens it can rival the weights themselves; at 128K it dwarfs them. On consumer GPUs that ceiling is the difference between running a useful context window and getting an out-of-memory error.

So we forked llama.cpp and added a new way to compress that cache, based on TurboQuant, a 2025 quantization method from Google Research. Our fork lives at github.com/openalchemy/llama.cpp and ships two new cache types — turbo3 (3 bits per value) and turbo2 (2 bits per value) — that drop straight into the existing -ctk / -ctv flags. This post is how it works and what it actually buys you.

Why the KV cache is so expensive

Every attention layer caches a key vector and a value vector for each token. The total size scales with layers × heads × head_dim × context × 2 (K and V) × bytes-per-value. The only term you can move without touching the model is the last one. Stored at FP16, that is 2 bytes per value. Standard llama.cpp already lets you drop to q8_0 (~1 byte) or q4_0 (~0.5 bytes), but naïve low-bit quantization of the cache hurts quality fast, because KV vectors have heavy outliers that a uniform grid spends all its resolution chasing.

What TurboQuant does differently

TurboQuant's insight is to rotate before you quantize. Multiply each vector by a random orthogonal matrix and its energy spreads out across all coordinates — the distribution of each coordinate becomes near-identical and outlier-free. After that rotation, a simple per-coordinate scalar quantizer (an optimal Lloyd–Max codebook) is provably near the best you can do for a given bit budget. It is data-oblivious: no calibration set, no per-model tuning, no training. The rotation is undone at read time, so attention sees vectors that round-trip with very little distortion.

The catch is cost: a dense random rotation is a matrix multiply on the hot path. Our engineering change is to replace the dense Haar matrix with a Fast Walsh–Hadamard Transform (FWHT) — an orthogonal transform built from nothing but additions and subtractions, running in O(d log d) instead of O(d²). It has the same energy-spreading property in expectation, it is self-inverse (so decoding is the same kernel run again, with the scale folded into softmax), and it has no float multiplies at all. That is what makes a per-token rotation cheap enough to live inside the cache write path.

How small does it get?

Each block packs 128 values plus one FP16 norm. That works out to 0.39 bytes per value for turbo3 and 0.27 bytes per value for turbo2 — roughly 5× and 7× smaller than FP16, and smaller than q4_0 while carrying a rotation that preserves quality far better than a raw 3- or 2-bit grid would.

fp16

2.00 B

q8_0

1.06 B

q4_0

0.56 B

turbo3

0.39 B

turbo2

0.27 B

Bytes stored per KV value, head_dim 128. Lower is smaller. turbo3 / turbo2 in blue. · bytes / value, including the per-block FP16 norm

The numbers on real models

Per-value ratios are nice; what matters is the VRAM you get back. We measured end-to-end on an RTX 5080 (16 GB) at a 32,768-token context, comparing total VRAM with an FP16 KV cache against turbo3 across two models. On Qwen2.5-Coder-14B (Q4_K_M), turbo3 freed 4,418 MiB (≈4.3 GB); on Qwen3.5-9B (Q4_K_M, head_dim 256) it freed 828 MiB. That 4.3 GB is the margin that lets a 14B coding model hold a full repo's worth of context on a card that otherwise can't.

14B · fp16

15,656 MiB

14B · turbo3

11,238 MiB

9B · fp16

7,968 MiB

9B · turbo3

7,140 MiB

Total VRAM (weights + KV cache) on RTX 5080 16 GB · 32K context · Q4_K_M weights. The gap within each pair is the KV cache saving: −4,418 MiB on the 14B, −828 MiB on the 9B. · MiB resident, FP16 vs turbo3

Smaller does not mean slower. Because the compressed cache stays resident closer to the compute units, the same run measured 1.24× faster prompt processing and 1.47× faster generation than the FP16 baseline — the rotation kernel more than pays for itself.

FP16

1.00×

prompt

1.24×

generation

1.47×

Throughput vs the FP16 baseline (=1.00×) · Qwen2.5-Coder-14B · turbo3. Higher is faster. · relative throughput

The technique is not tied to one head shape: the 9B above uses a head_dim of 256 (two stacked 128-blocks), while the 14B uses 128. Any model whose per-head dimension is a multiple of 128 is supported — that covers the Qwen 2.5 / 3 / 3.5 families, Llama 3.x, Mistral, Mixtral and Gemma.

Using it

The easiest path is no code at all: TurboQuant already ships in the OpenAlchemy Engine desktop app (v0.5.1, and available since v0.3.0). Open Settings → Runtime → KV Cache Quantization and set both the key and value cache to TurboQuant 3-bit — no rebuild, no flags; it takes effect the next time a model loads.

Prefer to drive llama.cpp directly? It is two flags. Build the fork, then point the cache types at turbo3 (Flash Attention is required, and the kernels are CUDA today):

./llama-cli -m Qwen2.5-Coder-14B-Instruct-Q4_K_M.gguf \
  -ngl 99 -fa 1 -c 32768 \
  -ctk turbo3 -ctv turbo3 \
  -p "def fibonacci(n):" -n 64

turbo2 is wired up the same way and is the right tool when you are squeezing the last gigabyte for an extreme context window — we ship it as experimental / opt-in because at 2 bits the quality trade-off is real and model-dependent. turbo3 is the one we reach for by default: in our testing it is effectively quality-neutral for long-context work.

What's shipped, and what's next

Today the path is end-to-end on GPU: the cache is stored compressed and dequantized to FP16 just before the attention kernel. Next on the roadmap is fusing dequantization inside the Flash-Attention tile so the keys never materialize in FP16 at all, plus backends beyond CUDA. The CPU reference path exists for correctness, not speed.

The fork is open source. Read the code, the kernels and the test harness at github.com/openalchemy/llama.cpp — and if you just want the result without building anything, it is part of the engine behind the OpenAlchemy inference platform.