First open-source implementation of TurboQuant (arXiv 2504.19874) — 4-7x LLM KV cache compression

These details have not been verified by PyPI

Project links

Project description

TurboQuant: First Open-Source Implementation

First open-source implementation of TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate (Zandieh, Daliri, Hadian, Mirrokni — Google Research / Google DeepMind / NYU, April 2025).

TurboQuant compresses LLM KV caches 4-7x at inference time using random rotation + optimal scalar quantization, with near-zero quality loss. No training, no calibration data, fully data-oblivious. Drop-in replacement for HuggingFace Transformers cache.

Key Results

Benchmarked across 5 model families, 6 models (7B to 70B) on NVIDIA H100 NVL (96GB):

Model	Architecture	KV Heads	head_dim	Outlier Layers	Prefill Fidelity	Saved @8K
Qwen2.5-7B	28L, qwen2	4	128	layers 0, 27	exact	380 MB
Llama-3.1-8B	32L, llama	8	128	none	exact	890 MB
Gemma-2-9B	42L, gemma2	8	256	none	exact	2,323 MB
Phi-4-14B	40L, phi3	10	128	none	exact	1,392 MB
Qwen2.5-32B	64L, qwen2	8	128	none	exact	1,791 MB
Llama-3.3-70B	80L, llama	8	128	none	exact	501 MB (@2K)

Prefill logits are bit-identical (0.0 difference) across all 6 tested models. Output quality is coherent and semantically correct — divergence from uncompressed output is purely greedy-decoding drift, not quality degradation.

Needle-in-a-Haystack: 100% Recall

Tested on Qwen2.5-7B across 5 context lengths (1K-16K) and 3 needle positions (25%, 50%, 75%):

	Default Cache	TurboQuant Cache
Recall	15/15 (100%)	15/15 (100%)

TurboQuant preserves retrieval quality perfectly, matching the paper's 0.997 recall claim.

Memory Savings Scale with Context

Qwen2.5-32B (4-bit weights) on H100:

Context	Default KV	TurboQuant KV	Saved
1K tokens	19.97 GB	19.79 GB	186 MB
4K tokens	21.23 GB	20.42 GB	833 MB
8K tokens	23.16 GB	21.41 GB	1,791 MB
32K tokens	~27.5 GB	~21.8 GB	~5,700 MB (projected)

Quickstart

from transformers import AutoModelForCausalLM, AutoTokenizer
from turboquant import TurboQuantCache

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-32B-Instruct", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-32B-Instruct")

# Auto-detect outlier layers, create compressed cache
skip = TurboQuantCache.calibrate_skip_layers(model, tokenizer)
cache = TurboQuantCache(model.config, nbits=4, skip_layers=skip)

# Use exactly like default cache
inputs = tokenizer("Hello world", return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=100, past_key_values=cache)

How It Works

TurboQuant implements Algorithm 1 (TurboQuant_mse) from the paper:

Random rotation (QR decomposition): transforms each KV vector so coordinates follow a known Beta distribution
Optimal scalar quantization (Lloyd-Max): quantizes each coordinate to 4 bits using precomputed codebook
Bit packing: stores 128-dim vectors as 64 bytes (uint4) + 2 bytes (norm) = 66 bytes vs 256 bytes BF16

Theoretical guarantee: MSE distortion ≤ 0.009 at 4-bit, within 2.7x of information-theoretic optimum (Shannon lower bound).

Our measured MSE: 0.0093 — matches the paper.

What We Found Beyond the Paper

Outlier Layer Norms

The paper mentions "splitting channels into outlier and non-outlier sets" without specifying how. We discovered:

Qwen2.5-7B: Layer 0 key norms = 273.8 (16.2x median). Layer 27 = outlier too.
Qwen2.5-32B: Layer 0 = 37.8 (2.35x median). Mild, no skip needed.
Llama-3.1-8B: Max/median ratio = 1.18x. No outliers at all.
Gemma-2-9B: Max/median ratio = 1.19x. No outliers.
Phi-4-14B: Max/median ratio = 1.38x. No outliers.

Finding: Smaller Qwen models have severe outlier layers. Larger models and non-Qwen architectures are well-balanced. Our calibrate_skip_layers() auto-detects outliers and keeps them in full precision.

head_dim Compatibility

The paper only tested head_dim=128 (Llama, Mistral). We verified TurboQuant works with head_dim=256 (Gemma-2) — the Lloyd-Max codebook adapts to any dimension since it's computed from the Beta distribution parameterized by d.

Architecture Coverage

Architecture	Paper Tested	We Tested	Works
Llama	Llama-3.1-8B	Llama-3.1-8B, 3.3-70B	Yes
Mistral	Ministral-7B	—	—
Qwen	—	Qwen2.5-7B, 32B	Yes (with outlier handling)
Gemma	—	Gemma-2-9B	Yes (head_dim=256)
Phi	—	Phi-4-14B	Yes

Files

turboquant/
├── __init__.py          # Public API
├── codebook.py          # Lloyd-Max solver for Beta distribution
├── quantizer.py         # Core TurboQuantizer: rotate → quantize → pack
├── packing.py           # uint4/uint2 bit packing
├── cache.py             # TurboQuantCache for HF Transformers
scripts/
├── verify.py            # Unit tests (MSE bounds, packing, fixed-point)
├── test_cache.py        # Cache API integration tests
├── benchmark_models.py  # Multi-model benchmark suite
├── run_inference.py     # Interactive inference demo
benchmark_results.json   # Raw benchmark data (all 5 models)

Verified Against Paper

Metric	Paper	Ours
MSE at 4-bit (unit vectors)	≤ 0.009	0.0093
MSE at 2-bit (unit vectors)	≤ 0.117	0.116
Compression ratio (per-vector)	~4x	3.88x
System compression @8K+	4-7x	7.2x
Prefill fidelity	"quality neutral"	exact (0.0 logit diff)
Double quantization	fixed point	verified (indices identical)

Requirements

Python 3.10+
PyTorch 2.7+ (CUDA 12.8 compatible)
HuggingFace Transformers 5.0+
scipy (for codebook computation)
bitsandbytes (optional, for 4-bit model loading)

Citation

If you use this implementation, please cite the original paper:

@article{zandieh2025turboquant,
  title={TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate},
  author={Zandieh, Amir and Daliri, Majid and Hadian, Majid and Mirrokni, Vahab},
  journal={arXiv preprint arXiv:2504.19874},
  year={2025}
}

License

This implementation is released under MIT License. The TurboQuant algorithm is described in the paper above.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Mar 29, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turboquant_impl-0.1.0.tar.gz (12.9 kB view details)

Uploaded Mar 29, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

turboquant_impl-0.1.0-py3-none-any.whl (11.2 kB view details)

Uploaded Mar 29, 2026 Python 3

File details

Details for the file turboquant_impl-0.1.0.tar.gz.

File metadata

Download URL: turboquant_impl-0.1.0.tar.gz
Upload date: Mar 29, 2026
Size: 12.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for turboquant_impl-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`79dc5e4a373ce7bc377d42adff050590f63be2ef6d9c86d13bc6ba7fb3e6ef54`
MD5	`2d9aa43d532b49765bf379c72f852886`
BLAKE2b-256	`7b8dc80b42a11a66c3b2a5e8fd390261dc7252218bd1685113b04b6660e1c41c`

See more details on using hashes here.

File details

Details for the file turboquant_impl-0.1.0-py3-none-any.whl.

File metadata

Download URL: turboquant_impl-0.1.0-py3-none-any.whl
Upload date: Mar 29, 2026
Size: 11.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for turboquant_impl-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d21c1baad48a092fd0c4311804af2ddd84a38bd3e0e4329e16e78b9479367c0e`
MD5	`22a7507a44a099a356e87c572013b25b`
BLAKE2b-256	`9053114c78646068b92e6bb24bc1a82336de1f1c958487049dff5036b32ce758`

See more details on using hashes here.

turboquant-impl 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

TurboQuant: First Open-Source Implementation

Key Results

Needle-in-a-Haystack: 100% Recall

Memory Savings Scale with Context

Quickstart

How It Works

What We Found Beyond the Paper

Outlier Layer Norms

head_dim Compatibility

Architecture Coverage

Files

Verified Against Paper

Requirements

Citation

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes