TurboQuant updates 2.5-bit mixed precision with PyTorch and llama.cpp ports
New discussion around TurboQuant focuses on its 2.5-bit mixed-precision setup and working PyTorch and llama.cpp implementations. The technique is moving from a research claim into deployable KV-cache compression with concrete porting details.

TL;DR
- Google Research’s TurboQuant post introduced TurboQuant as a KV-cache and vector compression method that combines PolarQuant with Quantized Johnson-Lindenstrauss error correction, with the stated goal of removing vector-quantization memory overhead without accuracy loss.
- The new implementation detail getting attention is TurboQuant’s non-integer bitrate setup: the fresh discussion says the reported 2.5-bit result comes from mixed precision, where outlier channels are quantized separately instead of using a literal fractional-bit format.
- The Hacker News core summary also points to working ports in both PyTorch and llama.cpp, shifting the story from a paper claim toward code paths engineers could test in real inference stacks.
- A supporting demo from a MacBook Air post shows “Google Turbo Quant” running on a 16GB MacBook Air M4, which is useful as a portability signal but not a published benchmark.
What changed in the TurboQuant discussion?
TurboQuant: Redefining AI efficiency with extreme compression
Google Research introduces TurboQuant, a compression algorithm that eliminates memory overhead in vector quantization for high-dimensional vectors in AI models and vector search. It uses PolarQuant for high-quality compression via random rotation and quantization, and Quantized Johnson-Lindenstrauss (QJL) for error correction. TurboQuant, to be presented at ICLR 2026, along with PolarQuant and QJL (AISTATS 2026), reduces key-value cache bottlenecks with zero accuracy loss, enabling applications in LLMs and search engines. Published March 24, 2026 by Amir Zandieh and Vahab Mirrokni.
The main new detail is not a new paper result but a clearer explanation of how TurboQuant reaches its headline bitrates. Google’s research post frames the method as high-dimensional compression for LLMs and vector search, using PolarQuant for random-rotation quantization and QJL for error correction to reduce KV-cache bottlenecks.
What the newer thread adds is the implementation logic behind “2.5-bit.” The fresh discussion says this comes from a mixed-precision split: as one commenter put it, “32 outlier channels are quantized at 3 bits” while “the remaining 96 channels use 2 bits,” yielding an effective 2.5-bit average thread highlights. That makes the claim more concrete for engineers: TurboQuant is using outlier-aware allocation, not some exotic fractional-bit datatype.
Is this moving beyond a research claim?
Discussion around TurboQuant: Redefining AI efficiency with extreme compression
Thread discussion highlights: - bdcs on mixed-precision bit allocation: our 2.5-bit setup, 32 outlier channels are quantized at 3 bits, while the remaining 96 channels use 2 bits, leading to an effective bit precision of (32 ×3 + 96×2)/128 = 2.5 - pstoll on independent implementation: And a group has published an independent working implementation today, nice to see: https://github.com/tonbistudio/turboquant-pytorch - akhenakh on llama.cpp integration: Someone implementing it on llamacpp already https://github.com/mudler/llama.cpp/commit/dee102db1bfd723c9...
The strongest signal here is early porting activity. The Hacker News core summary highlights an “independent working implementation” in PyTorch and says someone is already implementing it in llama.cpp via linked community work PyTorch port and llama.cpp work. That is still community-driven, but it means the technique is already being mapped onto common inference tooling.
A supporting X post from Wes Roth shows “Google Turbo Quant” running on “a standard MacBook Air M4 with just 16GB of RAM,” alongside a short terminal demo MacBook demo. It does not establish throughput, latency, or accuracy, but paired with the PyTorch and llama.cpp efforts it reinforces the near-term story: TurboQuant is being translated into practical memory-saving experiments rather than staying confined to a research blog.