Skip to main content

First open-source implementation of TurboQuant (arXiv 2504.19874) — 4-7x LLM KV cache compression

Project description

TurboQuant: First Open-Source Implementation

First open-source implementation of TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate (Zandieh, Daliri, Hadian, Mirrokni — Google Research / Google DeepMind / NYU, April 2025).

TurboQuant compresses LLM KV caches 4-7x at inference time using random rotation + optimal scalar quantization, with near-zero quality loss. No training, no calibration data, fully data-oblivious. Drop-in replacement for HuggingFace Transformers cache.

Key Results

Benchmarked across 5 model families, 6 models (7B to 70B) on NVIDIA H100 NVL (96GB):

Model Architecture KV Heads head_dim Outlier Layers Prefill Fidelity Saved @8K
Qwen2.5-7B 28L, qwen2 4 128 layers 0, 27 exact 380 MB
Llama-3.1-8B 32L, llama 8 128 none exact 890 MB
Gemma-2-9B 42L, gemma2 8 256 none exact 2,323 MB
Phi-4-14B 40L, phi3 10 128 none exact 1,392 MB
Qwen2.5-32B 64L, qwen2 8 128 none exact 1,791 MB
Llama-3.3-70B 80L, llama 8 128 none exact 501 MB (@2K)

Prefill logits are bit-identical (0.0 difference) across all 6 tested models. Output quality is coherent and semantically correct — divergence from uncompressed output is purely greedy-decoding drift, not quality degradation.

Needle-in-a-Haystack: 100% Recall

Tested on Qwen2.5-7B across 5 context lengths (1K-16K) and 3 needle positions (25%, 50%, 75%):

Default Cache TurboQuant Cache
Recall 15/15 (100%) 15/15 (100%)

TurboQuant preserves retrieval quality perfectly, matching the paper's 0.997 recall claim.

Memory Savings Scale with Context

Qwen2.5-32B (4-bit weights) on H100:

Context Default KV TurboQuant KV Saved
1K tokens 19.97 GB 19.79 GB 186 MB
4K tokens 21.23 GB 20.42 GB 833 MB
8K tokens 23.16 GB 21.41 GB 1,791 MB
32K tokens ~27.5 GB ~21.8 GB ~5,700 MB (projected)

Quickstart

from transformers import AutoModelForCausalLM, AutoTokenizer
from turboquant import TurboQuantCache

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-32B-Instruct", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-32B-Instruct")

# Auto-detect outlier layers, create compressed cache
skip = TurboQuantCache.calibrate_skip_layers(model, tokenizer)
cache = TurboQuantCache(model.config, nbits=4, skip_layers=skip)

# Use exactly like default cache
inputs = tokenizer("Hello world", return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=100, past_key_values=cache)

How It Works

TurboQuant implements Algorithm 1 (TurboQuant_mse) from the paper:

  1. Random rotation (QR decomposition): transforms each KV vector so coordinates follow a known Beta distribution
  2. Optimal scalar quantization (Lloyd-Max): quantizes each coordinate to 4 bits using precomputed codebook
  3. Bit packing: stores 128-dim vectors as 64 bytes (uint4) + 2 bytes (norm) = 66 bytes vs 256 bytes BF16

Theoretical guarantee: MSE distortion ≤ 0.009 at 4-bit, within 2.7x of information-theoretic optimum (Shannon lower bound).

Our measured MSE: 0.0093 — matches the paper.

What We Found Beyond the Paper

Outlier Layer Norms

The paper mentions "splitting channels into outlier and non-outlier sets" without specifying how. We discovered:

  • Qwen2.5-7B: Layer 0 key norms = 273.8 (16.2x median). Layer 27 = outlier too.
  • Qwen2.5-32B: Layer 0 = 37.8 (2.35x median). Mild, no skip needed.
  • Llama-3.1-8B: Max/median ratio = 1.18x. No outliers at all.
  • Gemma-2-9B: Max/median ratio = 1.19x. No outliers.
  • Phi-4-14B: Max/median ratio = 1.38x. No outliers.

Finding: Smaller Qwen models have severe outlier layers. Larger models and non-Qwen architectures are well-balanced. Our calibrate_skip_layers() auto-detects outliers and keeps them in full precision.

head_dim Compatibility

The paper only tested head_dim=128 (Llama, Mistral). We verified TurboQuant works with head_dim=256 (Gemma-2) — the Lloyd-Max codebook adapts to any dimension since it's computed from the Beta distribution parameterized by d.

Architecture Coverage

Architecture Paper Tested We Tested Works
Llama Llama-3.1-8B Llama-3.1-8B, 3.3-70B Yes
Mistral Ministral-7B
Qwen Qwen2.5-7B, 32B Yes (with outlier handling)
Gemma Gemma-2-9B Yes (head_dim=256)
Phi Phi-4-14B Yes

Files

turboquant/
├── __init__.py          # Public API
├── codebook.py          # Lloyd-Max solver for Beta distribution
├── quantizer.py         # Core TurboQuantizer: rotate → quantize → pack
├── packing.py           # uint4/uint2 bit packing
├── cache.py             # TurboQuantCache for HF Transformers
scripts/
├── verify.py            # Unit tests (MSE bounds, packing, fixed-point)
├── test_cache.py        # Cache API integration tests
├── benchmark_models.py  # Multi-model benchmark suite
├── run_inference.py     # Interactive inference demo
benchmark_results.json   # Raw benchmark data (all 5 models)

Verified Against Paper

Metric Paper Ours
MSE at 4-bit (unit vectors) ≤ 0.009 0.0093
MSE at 2-bit (unit vectors) ≤ 0.117 0.116
Compression ratio (per-vector) ~4x 3.88x
System compression @8K+ 4-7x 7.2x
Prefill fidelity "quality neutral" exact (0.0 logit diff)
Double quantization fixed point verified (indices identical)

Requirements

  • Python 3.10+
  • PyTorch 2.7+ (CUDA 12.8 compatible)
  • HuggingFace Transformers 5.0+
  • scipy (for codebook computation)
  • bitsandbytes (optional, for 4-bit model loading)

Citation

If you use this implementation, please cite the original paper:

@article{zandieh2025turboquant,
  title={TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate},
  author={Zandieh, Amir and Daliri, Majid and Hadian, Majid and Mirrokni, Vahab},
  journal={arXiv preprint arXiv:2504.19874},
  year={2025}
}

License

This implementation is released under MIT License. The TurboQuant algorithm is described in the paper above.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turboquant_impl-0.1.0.tar.gz (12.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

turboquant_impl-0.1.0-py3-none-any.whl (11.2 kB view details)

Uploaded Python 3

File details

Details for the file turboquant_impl-0.1.0.tar.gz.

File metadata

  • Download URL: turboquant_impl-0.1.0.tar.gz
  • Upload date:
  • Size: 12.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for turboquant_impl-0.1.0.tar.gz
Algorithm Hash digest
SHA256 79dc5e4a373ce7bc377d42adff050590f63be2ef6d9c86d13bc6ba7fb3e6ef54
MD5 2d9aa43d532b49765bf379c72f852886
BLAKE2b-256 7b8dc80b42a11a66c3b2a5e8fd390261dc7252218bd1685113b04b6660e1c41c

See more details on using hashes here.

File details

Details for the file turboquant_impl-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for turboquant_impl-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d21c1baad48a092fd0c4311804af2ddd84a38bd3e0e4329e16e78b9479367c0e
MD5 22a7507a44a099a356e87c572013b25b
BLAKE2b-256 9053114c78646068b92e6bb24bc1a82336de1f1c958487049dff5036b32ce758

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page