breakingMarch 27, 2026

TurboQuant cuts KV cache memory 6x with 3-bit storage

Google Research said TurboQuant can shrink KV cache storage to 3 bits with roughly 6x less memory, and early implementations already surfaced in PyTorch, llama.cpp, and Atomic Chat. The work targets a core inference bottleneck for long-context serving on local and server hardware.

3 min read

TurboQuant cuts KV cache memory 6x with 3-bit storage

TL;DR

Google's TurboQuant writeup says TurboQuant targets high-dimensional vector compression for LLMs and vector search, with KV cache compression as a primary use case and a claimed path to "zero accuracy loss" through a PolarQuant-plus-QJL pipeline.
The Hacker News core thread quickly turned from paper summary to implementation tracking: commenters pointed to an independent PyTorch port and an in-progress llama.cpp integration the same day.
Early local-app demos are already using it. In Atomic Chat, a MacBook Air demo showed Qwen 3.5-9B summarizing 50,000 words "in just seconds" on a 16 GB M4 Air while claiming a roughly 3x larger context window and 3x faster processing than before.
Community posts are now framing TurboQuant less as a research curiosity than a serving primitive: the llama.cpp CUDA post claims 3.5x KV cache compression with quality that "beats q8_0," which is the kind of tradeoff infra teams actually benchmark.

What changed in TurboQuant?

Hacker News

TurboQuant: Redefining AI efficiency with extreme compression

558 upvotes · 160 comments

Google Research describes TurboQuant as a family of quantization methods for "high-dimensional vectors" that goes after one of the most expensive parts of long-context inference: KV cache storage. In the research writeup, the core claim is that PolarQuant rotates vectors into polar coordinates for better compression, while QJL acts as an error-correction step, together reducing the key-value bottleneck without the usual quality hit.

The practical angle is memory, not just model size. The Hacker News core thread centers on KV-cache and vector compression as a deployment problem, and one commenter summarized the intuition as using matrix transforms to make models "fit in smaller boxes" while needing "way less RAM" discussion highlights. That matters because KV cache often scales with sequence length at serving time, so a method that cuts cache storage more aggressively than standard quantization changes the ceiling for context length and batch density.

Where is it already showing up?

The fastest signal is implementation activity. According to the Hacker News discussion highlights, developers had already linked an independent PyTorch implementation and a llama.cpp commit shortly after the research post landed, which suggests people are testing this at the runtime layer rather than waiting for a polished reference release.

Atomic Chat pushed the consumer-facing version of that story. Its MacBook Air demo claims TurboQuant is running locally on a 16 GB MacBook Air M4 with Qwen 3.5-9B and a 100,000-token context setting, while the mirrored app post points users to a Mac download and markets TurboQuant-backed local inference as faster and lighter-weight Atomic Chat download.

The most engineering-specific datapoint comes from the llama.cpp side. A repost amplified by Hugging Face says "TurboQuant CUDA for llama.cpp" delivers "3.5x KV cache compression" and "beats q8_0 quality (-1.17% PPL)" llama.cpp CUDA post. That is still a community claim rather than a vendor benchmark, but it is the clearest sign yet that TurboQuant is moving from paper language into measurable inference tradeoffs.

TL;DR

What changed in TurboQuant?

Where is it already showing up?

Discussion across the web