Google's TurboQuant: 6x Less Memory for LLM Inference (2026)

April 6, 2026

Google's TurboQuant: 6x Less Memory for LLM Inference (2026)

Google's TurboQuant algorithm compresses the KV cache in large language models to just 3 bits per value — cutting memory use by 6x with zero accuracy loss. On NVIDIA H100 GPUs, 4-bit TurboQuant accelerates attention logit computation by up to 8x compared to 32-bit unquantized keys. Published as a conference paper at ICLR 2026, TurboQuant has already rattled memory chip stocks and sparked a wave of open-source implementations.1

What You'll Learn

  • Why the KV cache is the #1 memory bottleneck in LLM inference
  • How TurboQuant's rotation-then-quantize approach achieves near-optimal compression
  • Exact benchmark results on LongBench, Needle-in-a-Haystack, and H100 attention speedup
  • What TurboQuant means for GPU memory economics and inference costs
  • How to start using TurboQuant today with community implementations

TL;DR

TurboQuant is a training-free, model-agnostic compression algorithm from Google Research that shrinks the KV cache — the biggest memory consumer during LLM inference — from 16-bit precision down to 3 bits. It requires no calibration data, no fine-tuning, and works on any transformer architecture. On standard benchmarks, 3.5-bit TurboQuant matches full-precision performance exactly, while 4-bit TurboQuant delivers up to 8x speedup on H100 attention logit computation versus 32-bit keys (roughly 4x versus the FP16 baseline used in practice). The paper will be formally presented at ICLR 2026 in Rio de Janeiro on April 25, 2026.2

The KV Cache Problem: Why Memory Is the Real Bottleneck

Every time a large language model generates a token, it stores key-value pairs for every previous token across every attention layer. This is the KV cache, and it grows linearly with sequence length.

For short prompts, model weights dominate GPU memory. But as context windows push toward 128K tokens and beyond, the KV cache becomes the primary bottleneck. A single 128K-token prompt on Llama 3 70B consumes roughly 40 GB of high-bandwidth memory just for KV storage — the entire capacity of an NVIDIA A100 40GB, or half the 80GB variant.3

The problem compounds at scale. Inference providers serving thousands of concurrent users face a harsh tradeoff: longer contexts or more users, but not both. Traditional inference systems waste 60–80% of allocated KV cache memory through fragmentation.4 While innovations like PagedAttention (from the vLLM project) reduced fragmentation to under 4%, the fundamental issue remains: each token's key-value representation is stored at full 16-bit precision, consuming far more memory than the information content warrants.

This is the problem TurboQuant attacks.

How TurboQuant Works: Rotation, Then Quantization

TurboQuant combines two techniques — PolarQuant and Quantized Johnson-Lindenstrauss (QJL) — into a pipeline that compresses KV vectors with mathematically provable near-optimal distortion.5

Step 1: Random Orthogonal Rotation

The first stage applies a random orthogonal rotation to each KV vector. This sounds abstract, but the effect is concrete: it spreads the energy of the vector uniformly across all coordinates. Before rotation, some coordinates carry much more information than others (the "outlier channels" problem that plagues naive quantization). After rotation, each coordinate follows a predictable statistical distribution — approximately Gaussian — making all coordinates equally important and equally compressible.6

The rotation matrix is generated once via QR decomposition of a random Gaussian matrix and reused for all vectors. This adds negligible computational overhead.

Step 2: Lloyd-Max Optimal Quantization

Because the post-rotation distribution is known analytically, TurboQuant computes mathematically optimal quantization buckets using the Lloyd-Max algorithm. Unlike learned quantization schemes that require calibration data and model-specific tuning, Lloyd-Max buckets are derived from probability theory alone. You compute them once, and they work on any model.7

At 3 bits, this means each coordinate maps to one of 8 optimally placed buckets. At 4 bits, 16 buckets. The key insight is that the random rotation guarantees the distribution matches the assumption — there are no outliers left to break the quantizer.

Step 3: QJL Error Correction

TurboQuant adds one additional bit per coordinate using the Quantized Johnson-Lindenstrauss transform. This creates a mathematical error-checking mechanism that preserves distance relationships between vectors — critical for attention score computation. The extra bit ensures that even at extreme compression ratios, the attention pattern remains faithful to the original.8

Why It's Called PolarQuant

The quantization stage (Steps 1 and 2) is called PolarQuant because it internally converts coordinates to polar form — radius and angle rather than Cartesian representation. This eliminates the costly per-block normalization constants that traditional quantizers require, further reducing overhead. TurboQuant is the full pipeline: PolarQuant for quantization plus QJL for error correction.9

Benchmark Results: Zero Accuracy Loss at 6x Compression

TurboQuant was evaluated on Llama-3.1-8B-Instruct, Gemma, and Mistral models across five standard long-context benchmarks: LongBench, Needle-in-a-Haystack, ZeroSCROLLS, RULER, and L-Eval.10

LongBench (Llama-3.1-8B-Instruct)

LongBench is a comprehensive suite covering question answering, code generation, and summarization across long documents. At 3.5 bits per value, TurboQuant achieved an average score of 50.06 — identical to the full-precision 16-bit baseline of 50.06. Even at an aggressive 2.5 bits, the score only dropped to 49.44.11

Needle-in-a-Haystack (Llama-3.1-8B-Instruct, up to 104K context)

This benchmark tests whether a model can retrieve a specific fact buried in a long document. TurboQuant scored 0.997 at 4x compression — functionally identical to full precision — across context lengths up to 104K tokens.12

Attention Throughput on H100

On NVIDIA H100 GPUs, 4-bit TurboQuant delivered up to 8x speedup in attention logit computation compared to 32-bit unquantized keys (roughly 4x versus the FP16 standard used in practice). This is not an end-to-end inference speedup — attention is a significant but not sole bottleneck, so actual wall-clock improvement for full inference will be lower. The speedup comes from reduced memory bandwidth requirements: smaller KV values mean more data fits in GPU cache, and fewer memory transactions are needed per attention operation.13

Compression Metrics

At 3 bits (TQ3), TurboQuant achieves 4.9x compression versus FP16, using 52 bytes per 128-value vector compared to 256 bytes at full precision. At 4 bits (TQ4), the compression ratio is 3.8x, with 68 bytes per 128-value vector. The mean squared error at TQ3 is 0.034, and at TQ4 just 0.009.14

How TurboQuant Compares to KIVI

KIVI, published at ICML 2024, is the standard baseline for KV cache quantization. It uses asymmetric 2-bit quantization with per-channel handling for keys and per-token handling for values, achieving roughly 2.6x memory reduction. TurboQuant matches or outperforms KIVI on LongBench at equivalent bit budgets while being simpler to implement — no per-channel or per-token calibration required.15

Market Impact: Why Memory Chip Stocks Dropped

When Google published the TurboQuant research blog on March 25, 2026, the market reaction was swift and dramatic. On the Korea Exchange the following day, SK Hynix shares fell 6.23% and Samsung Electronics dropped 4.8%, dragging the KOSPI index down as much as 3%. U.S. memory stocks also sold off sharply on March 26, with SanDisk falling as much as 8%, Micron dropping around 5%, and Western Digital declining roughly 5%, according to reporting from 24/7 Wall St.16

The logic was straightforward: if AI inference can run on a fraction of the memory, demand for high-bandwidth memory (HBM) and server DRAM weakens. But this framing misses a critical distinction. TurboQuant compresses inference-stage working memory only. It offers no reduction in the massive memory requirements of model training, which is where the largest memory procurement volumes originate.17

Analysts largely called the selloff an overreaction. Lynx Equity Strategies maintained a $700 price target on Micron, recommending buying the dip.18 The broader consensus: TurboQuant is a genuine breakthrough for inference efficiency, but it does not fundamentally change the demand trajectory for memory hardware across the full AI stack.

Getting Started: Open-Source Implementations

Google has not yet released official TurboQuant code, but the community has moved fast. As of early April 2026, several implementations are available.19

For llama.cpp users, the turboquant_plus project offers an experimental integration with Metal support on Apple Silicon. It has been validated end-to-end from 1.5B to 104B parameter models on M5 Max, with the 104B model at 128K context achieving perplexity of 4.024 at just 74 GB peak memory. Prefill throughput roughly matches q8_0 while compressing the KV cache by about 4.6x.20

For vLLM, a fork based on version 0.18.1rc1 with TurboQuant support is available on GitHub, and an official feature request is open in the vLLM project.21

Independent implementations also exist in PyTorch (with Triton kernels) and MLX. Google's official implementation is expected around Q2 2026.22

What This Means for LLM Inference Economics

TurboQuant is not the first KV cache compression technique, but its combination of properties makes it uniquely practical: it requires no training data, no model-specific calibration, no fine-tuning, and drops into any transformer architecture. The compression is nearly free computationally — the rotation and quantization add negligible overhead compared to the memory bandwidth savings.

For inference providers, the math is compelling. A 6x reduction in KV cache memory means either serving 6x more concurrent users on the same hardware, or extending context windows 6x longer at the same concurrency. In a market where GPU hours are the dominant cost, this translates directly to lower per-token inference costs.

For developers running models locally, TurboQuant means that models which previously required high-end GPUs for long-context work can now run on consumer hardware. The llama.cpp implementation already demonstrates 104B parameter models running at 128K context on a MacBook.23

Combined with other recent advances — PagedAttention for memory management, speculative decoding for latency, and NVIDIA's NVFP4 for low-precision model weights and activations — TurboQuant adds another layer to the inference optimization stack: extreme KV cache compression with zero quality degradation.

If you're interested in how the AI hardware landscape is evolving alongside these software optimizations, see our analysis of Huawei's Ascend 950PR and its challenge to NVIDIA. For a hands-on guide to running large models locally, check out our complete guide to building local AI with RAG agents and Ollama.


Footnotes

  1. TurboQuant published as ICLR 2026 conference paper. Source: ICLR 2026 Poster

  2. ICLR 2026 conference, Rio de Janeiro, poster presentation April 25, 2026. Source: ICLR 2026 Schedule

  3. Llama 3 70B KV cache memory at 128K tokens. Source: BentoML LLM Inference Handbook

  4. KV cache memory fragmentation 60–80%. Source: BentoML LLM Inference Handbook

  5. TurboQuant combines PolarQuant and QJL. Source: Google Research Blog

  6. Random orthogonal rotation spreads energy uniformly. Source: Google Research Blog

  7. Lloyd-Max optimal quantization derived from probability theory. Source: Google Research Blog

  8. QJL adds 1 bit per coordinate for error correction. Source: Google Research Blog

  9. PolarQuant converts to polar coordinates. Source: Google Research Blog

  10. Evaluated on Llama-3.1-8B, Gemma, Mistral across 5 benchmarks. Source: Tom's Hardware

  11. LongBench 3.5-bit score 50.06 matching FP16 baseline; 2.5-bit score 49.44. Source: Tom's Hardware

  12. Needle-in-a-Haystack 0.997 at 4x compression up to 104K context. Source: Tom's Hardware

  13. 4-bit TurboQuant 8x attention throughput on H100. Source: Tom's Hardware

  14. TQ3: 4.9x compression, 52 bytes/128 values, MSE 0.034; TQ4: 3.8x, 68 bytes, MSE 0.009. Source: llama.cpp Discussion #20969

  15. TurboQuant matches or outperforms KIVI on LongBench. Source: Tom's Hardware

  16. SK Hynix -6.23%, Samsung -4.8% on Korea Exchange; SanDisk ~-8%, Micron ~-5%, WDC ~-5% in U.S. Source: 24/7 Wall St., Seoul Economic Daily

  17. TurboQuant targets inference only, not training memory. Source: TradingKey Analysis

  18. Lynx Equity $700 Micron target, recommends buying the dip. Source: Seoul Economic Daily

  19. Community implementations available despite no official Google code. Source: GitHub vLLM Issue #38171

  20. llama.cpp turboquant_plus: 104B at 128K, PPL 4.024, 74 GB peak, 4.6x compression. Source: GitHub turboquant_plus

  21. vLLM 0.18.1rc1 fork with TurboQuant. Source: GitHub vllm-turboquant

  22. Official Google code expected Q2 2026. Source: Kaitchup Substack

  23. 104B model at 128K context on MacBook with turboquant_plus. Source: GitHub turboquant_plus

  24. TurboQuant is training-free. Source: Google Research Blog

  25. TurboQuant compresses inference KV cache only. Source: TradingKey Analysis

  26. Benchmarks on H100; community support for NVIDIA, Apple Silicon, CPU. Source: Tom's Hardware

  27. TurboQuant and PagedAttention are complementary. Source: BentoML LLM Inference Handbook

  28. Official code expected Q2 2026. Source: Kaitchup Substack

Frequently Asked Questions

No. TurboQuant is entirely training-free and requires no calibration data. The quantization parameters are derived mathematically from probability theory, not learned from data. It works on any pre-trained transformer model. 24

FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.