New discussion around TurboQuant focuses on its 2.5-bit mixed-precision setup and working PyTorch and llama.cpp implementations. The technique is moving from a research claim into deployable KV-cache compression with concrete porting details.

Posted by ray__
Google Research introduces TurboQuant, a compression algorithm that eliminates memory overhead in vector quantization for high-dimensional vectors in AI models and vector search. It uses PolarQuant for high-quality compression via random rotation and quantization, and Quantized Johnson-Lindenstrauss (QJL) for error correction. TurboQuant, to be presented at ICLR 2026, along with PolarQuant and QJL (AISTATS 2026), reduces key-value cache bottlenecks with zero accuracy loss, enabling applications in LLMs and search engines. Published March 24, 2026 by Amir Zandieh and Vahab Mirrokni.
The main new detail is not a new paper result but a clearer explanation of how TurboQuant reaches its headline bitrates. Google’s research post frames the method as high-dimensional compression for LLMs and vector search, using PolarQuant for random-rotation quantization and QJL for error correction to reduce KV-cache bottlenecks.
What the newer thread adds is the implementation logic behind “2.5-bit.” The fresh discussion says this comes from a mixed-precision split: as one commenter put it, “32 outlier channels are quantized at 3 bits” while “the remaining 96 channels use 2 bits,” yielding an effective 2.5-bit average thread highlights. That makes the claim more concrete for engineers: TurboQuant is using outlier-aware allocation, not some exotic fractional-bit datatype.
Posted by ray__
Thread discussion highlights: - bdcs on mixed-precision bit allocation: our 2.5-bit setup, 32 outlier channels are quantized at 3 bits, while the remaining 96 channels use 2 bits, leading to an effective bit precision of (32 ×3 + 96×2)/128 = 2.5 - pstoll on independent implementation: And a group has published an independent working implementation today, nice to see: https://github.com/tonbistudio/turboquant-pytorch - akhenakh on llama.cpp integration: Someone implementing it on llamacpp already https://github.com/mudler/llama.cpp/commit/dee102db1bfd723c9...
The strongest signal here is early porting activity. The Hacker News core summary highlights an “independent working implementation” in PyTorch and says someone is already implementing it in llama.cpp via linked community work PyTorch port and llama.cpp work. That is still community-driven, but it means the technique is already being mapped onto common inference tooling.
A supporting X post from Wes Roth shows “Google Turbo Quant” running on “a standard MacBook Air M4 with just 16GB of RAM,” alongside a short terminal demo MacBook demo. It does not establish throughput, latency, or accuracy, but paired with the PyTorch and llama.cpp efforts it reinforces the near-term story: TurboQuant is being translated into practical memory-saving experiments rather than staying confined to a research blog.
Posted by ray__
TurboQuant matters as an inference-memory technique: the key takeaway is aggressive KV-cache/vector compression with quality preserved via rotation plus outlier-aware mixed precision. The practical signal is that people are already porting it to PyTorch and llama.cpp, so this is moving toward real deployment workflows, not just theory.
Posted by ray__
The only materially new discussion today is a paper-detail clarification about non-integer bitrates. One commenter points out that the 2.5-bit and 3.5-bit results come from splitting channels into outlier and non-outlier groups and quantizing them separately, rather than using a literal fractional-bit representation. That adds a useful implementation detail to the earlier high-level explanations: TurboQuant is not just “quantize harder,” but a mixed-precision scheme that treats outlier channels differently to preserve quality while pushing compression further.