Skip to main content

First open-source implementation of TurboQuant (arXiv 2504.19874) — 4-7x LLM KV cache compression

Project description

TurboQuant: First Open-Source Implementation

First open-source implementation of TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate (Zandieh, Daliri, Hadian, Mirrokni — Google Research / Google DeepMind / NYU, April 2025).

TurboQuant compresses LLM KV caches 4-7x at inference time using random rotation + optimal scalar quantization, with near-zero quality loss. No training, no calibration data, fully data-oblivious. Drop-in replacement for HuggingFace Transformers cache.

Key Results

Benchmarked across 5 model families, 6 models (7B to 70B) on NVIDIA H100 NVL (96GB):

Model Architecture KV Heads head_dim Outlier Layers Prefill Fidelity Saved @8K
Qwen2.5-7B 28L, qwen2 4 128 layers 0, 27 exact 380 MB
Llama-3.1-8B 32L, llama 8 128 none exact 890 MB
Gemma-2-9B 42L, gemma2 8 256 none exact 2,323 MB
Phi-4-14B 40L, phi3 10 128 none exact 1,392 MB
Qwen2.5-32B 64L, qwen2 8 128 none exact 1,791 MB
Llama-3.3-70B 80L, llama 8 128 none exact 501 MB (@2K)

Prefill logits are bit-identical (0.0 difference) across all 6 tested models. Output quality is coherent and semantically correct — divergence from uncompressed output is purely greedy-decoding drift, not quality degradation.

Needle-in-a-Haystack: 100% Recall

Tested on Qwen2.5-7B across 5 context lengths (1K-16K) and 3 needle positions (25%, 50%, 75%):

Default Cache TurboQuant Cache
Recall 15/15 (100%) 15/15 (100%)

TurboQuant preserves retrieval quality perfectly, matching the paper's 0.997 recall claim.

Memory Savings Scale with Context

Qwen2.5-32B (4-bit weights) on H100:

Context Default KV TurboQuant KV Saved
1K tokens 19.97 GB 19.79 GB 186 MB
4K tokens 21.23 GB 20.42 GB 833 MB
8K tokens 23.16 GB 21.41 GB 1,791 MB
32K tokens ~27.5 GB ~21.8 GB ~5,700 MB (projected)

Quickstart

from transformers import AutoModelForCausalLM, AutoTokenizer
from turboquant import TurboQuantCache

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-32B-Instruct", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-32B-Instruct")

# Auto-detect outlier layers, create compressed cache
skip = TurboQuantCache.calibrate_skip_layers(model, tokenizer)
cache = TurboQuantCache(model.config, nbits=4, skip_layers=skip)

# Use exactly like default cache
inputs = tokenizer("Hello world", return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=100, past_key_values=cache)

How It Works

TurboQuant implements Algorithm 1 (TurboQuant_mse) from the paper:

  1. Random rotation (QR decomposition): transforms each KV vector so coordinates follow a known Beta distribution
  2. Optimal scalar quantization (Lloyd-Max): quantizes each coordinate to 4 bits using precomputed codebook
  3. Bit packing: stores 128-dim vectors as 64 bytes (uint4) + 2 bytes (norm) = 66 bytes vs 256 bytes BF16

Theoretical guarantee: MSE distortion ≤ 0.009 at 4-bit, within 2.7x of information-theoretic optimum (Shannon lower bound).

Our measured MSE: 0.0093 — matches the paper.

What We Found Beyond the Paper

Outlier Layer Norms

The paper mentions "splitting channels into outlier and non-outlier sets" without specifying how. We discovered:

  • Qwen2.5-7B: Layer 0 key norms = 273.8 (16.2x median). Layer 27 = outlier too.
  • Qwen2.5-32B: Layer 0 = 37.8 (2.35x median). Mild, no skip needed.
  • Llama-3.1-8B: Max/median ratio = 1.18x. No outliers at all.
  • Gemma-2-9B: Max/median ratio = 1.19x. No outliers.
  • Phi-4-14B: Max/median ratio = 1.38x. No outliers.

Finding: Smaller Qwen models have severe outlier layers. Larger models and non-Qwen architectures are well-balanced. Our calibrate_skip_layers() auto-detects outliers and keeps them in full precision.

head_dim Compatibility

The paper only tested head_dim=128 (Llama, Mistral). We verified TurboQuant works with head_dim=256 (Gemma-2) — the Lloyd-Max codebook adapts to any dimension since it's computed from the Beta distribution parameterized by d.

Architecture Coverage

Architecture Paper Tested We Tested Works
Llama Llama-3.1-8B Llama-3.1-8B, 3.3-70B Yes
Mistral Ministral-7B
Qwen Qwen2.5-7B, 32B Yes (with outlier handling)
Gemma Gemma-2-9B Yes (head_dim=256)
Phi Phi-4-14B Yes

Files

turboquant/
├── __init__.py          # Public API
├── codebook.py          # Lloyd-Max solver for Beta distribution
├── quantizer.py         # Core TurboQuantizer: rotate → quantize → pack
├── packing.py           # uint4/uint2 bit packing
├── cache.py             # TurboQuantCache for HF Transformers
scripts/
├── verify.py            # Unit tests (MSE bounds, packing, fixed-point)
├── test_cache.py        # Cache API integration tests
├── benchmark_models.py  # Multi-model benchmark suite
├── run_inference.py     # Interactive inference demo
benchmark_results.json   # Raw benchmark data (all 5 models)

Verified Against Paper

Metric Paper Ours
MSE at 4-bit (unit vectors) ≤ 0.009 0.0093
MSE at 2-bit (unit vectors) ≤ 0.117 0.116
Compression ratio (per-vector) ~4x 3.88x
System compression @8K+ 4-7x 7.2x
Prefill fidelity "quality neutral" exact (0.0 logit diff)
Double quantization fixed point verified (indices identical)

Requirements

  • Python 3.10+
  • PyTorch 2.7+ (CUDA 12.8 compatible)
  • HuggingFace Transformers 5.0+
  • scipy (for codebook computation)
  • bitsandbytes (optional, for 4-bit model loading)

Citation

If you use this implementation, please cite the original paper:

@article{zandieh2025turboquant,
  title={TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate},
  author={Zandieh, Amir and Daliri, Majid and Hadian, Majid and Mirrokni, Vahab},
  journal={arXiv preprint arXiv:2504.19874},
  year={2025}
}

License

This implementation is released under MIT License. The TurboQuant algorithm is described in the paper above.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

turbokv-0.1.0.tar.gz (9.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

turbokv-0.1.0-py3-none-any.whl (11.1 kB view details)

Uploaded Python 3

File details

Details for the file turbokv-0.1.0.tar.gz.

File metadata

  • Download URL: turbokv-0.1.0.tar.gz
  • Upload date:
  • Size: 9.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for turbokv-0.1.0.tar.gz
Algorithm Hash digest
SHA256 07568c9a3b5ab04c2bb4d6c73bd28e9c0124a586c61663539a3390c34534110b
MD5 960cc3c64effb1b694a0d8790faa876c
BLAKE2b-256 b0ab49c845e4f6d52a2bea09adce1632305fe8f33f143bc1b577a71a5ec5140c

See more details on using hashes here.

File details

Details for the file turbokv-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: turbokv-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 11.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for turbokv-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c2e354f6f3bf8324d239e5450b77d7e4625ac8b0d4eb4f5d4bf55927ba782846
MD5 af6a5e5ec8fab8388dbfb7aadaa23190
BLAKE2b-256 eb17549c346011ebb358295b5d74b441975ae6752db038ce5495f9e1255173fa

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page