Google Research said TurboQuant can shrink KV cache storage to 3 bits with roughly 6x less memory, and early implementations already surfaced in PyTorch, llama.cpp, and Atomic Chat. The work targets a core inference bottleneck for long-context serving on local and server hardware.

Posted by ray__
Google Research introduces TurboQuant, a set of advanced quantization algorithms including TurboQuant (for ICLR 2026), Quantized Johnson-Lindenstrauss (QJL), and PolarQuant (for AISTATS 2026), that enable massive compression for large language models and vector search engines by addressing memory overhead in vector quantization of high-dimensional vectors. TurboQuant achieves high reduction in model size with zero accuracy loss for KV cache compression and vector search via two steps: PolarQuant for high-quality compression by rotating vectors into polar coordinates, and QJL for error correction, reducing key-value bottlenecks without performance sacrifice.
Google Research describes TurboQuant as a family of quantization methods for "high-dimensional vectors" that goes after one of the most expensive parts of long-context inference: KV cache storage. In the research writeup, the core claim is that PolarQuant rotates vectors into polar coordinates for better compression, while QJL acts as an error-correction step, together reducing the key-value bottleneck without the usual quality hit.
The practical angle is memory, not just model size. The Hacker News core thread centers on KV-cache and vector compression as a deployment problem, and one commenter summarized the intuition as using matrix transforms to make models "fit in smaller boxes" while needing "way less RAM" discussion highlights. That matters because KV cache often scales with sequence length at serving time, so a method that cuts cache storage more aggressively than standard quantization changes the ceiling for context length and batch density.
The fastest signal is implementation activity. According to the Hacker News discussion highlights, developers had already linked an independent PyTorch implementation and a llama.cpp commit shortly after the research post landed, which suggests people are testing this at the runtime layer rather than waiting for a polished reference release.
Atomic Chat pushed the consumer-facing version of that story. Its MacBook Air demo claims TurboQuant is running locally on a 16 GB MacBook Air M4 with Qwen 3.5-9B and a 100,000-token context setting, while the mirrored app post points users to a Mac download and markets TurboQuant-backed local inference as faster and lighter-weight Atomic Chat download.
The most engineering-specific datapoint comes from the llama.cpp side. A repost amplified by Hugging Face says "TurboQuant CUDA for llama.cpp" delivers "3.5x KV cache compression" and "beats q8_0 quality (-1.17% PPL)" llama.cpp CUDA post. That is still a community claim rather than a vendor benchmark, but it is the clearest sign yet that TurboQuant is moving from paper language into measurable inference tradeoffs.
Posted by ray__
TurboQuant is relevant if you care about serving or deploying LLMs efficiently: the thread centers on compression of KV caches and high-dimensional vectors, plus practical follow-on implementations in PyTorch and llama.cpp. The main engineering takeaway is that rotation-based quantization may reduce memory pressure without the usual accuracy loss, which could matter for inference cost and context-length scaling.
Posted by ray__
Thread discussion highlights: - redanddead on rotation-based compression intuition: AI and graphics are matrices... by mutating and transforming the matrix with a function... you have matrices that make smarter models fit in smaller boxes, needing way less RAM... - pstoll on open-source implementation: And a group has published an independent working implementation today, nice to see: https://github.com/tonbistudio/turboquant-pytorch - akhenakh on llama.cpp support: Someone implementing it on llamacpp already https://github.com/mudler/llama.cpp/commit/dee102db1bfd723c91f67138b8018ce35a6be477
Google Turbo Quant running Locally in Atomic Chat MacBook Air M4 16 GB Model: QWEN3.5-9B Context window: 100000 Summarising 50000 words in just seconds.. You can do 3x larger context window, processing 3x faster than before! They are first that have integrated Google Show more
Google Turbo Quant running Locally in Atomic Chat MacBook Air M4 16 GB Model: QWEN3.5-9B Context window: 50000 Summarising 20000 words in just seconds.. You can do 3x larger context window, processing 3x faster than before!