updateMarch 28, 2026

TurboQuant updates 2.5-bit mixed precision with PyTorch and llama.cpp ports

New discussion around TurboQuant focuses on its 2.5-bit mixed-precision setup and working PyTorch and llama.cpp implementations. The technique is moving from a research claim into deployable KV-cache compression with concrete porting details.

2 min read

TurboQuant updates 2.5-bit mixed precision with PyTorch and llama.cpp ports

TL;DR

Google Research’s TurboQuant post introduced TurboQuant as a KV-cache and vector compression method that combines PolarQuant with Quantized Johnson-Lindenstrauss error correction, with the stated goal of removing vector-quantization memory overhead without accuracy loss.
The new implementation detail getting attention is TurboQuant’s non-integer bitrate setup: the fresh discussion says the reported 2.5-bit result comes from mixed precision, where outlier channels are quantized separately instead of using a literal fractional-bit format.
The Hacker News core summary also points to working ports in both PyTorch and llama.cpp, shifting the story from a paper claim toward code paths engineers could test in real inference stacks.
A supporting demo from a MacBook Air post shows “Google Turbo Quant” running on a 16GB MacBook Air M4, which is useful as a portability signal but not a published benchmark.

What changed in the TurboQuant discussion?

Hacker News

TurboQuant: Redefining AI efficiency with extreme compression

566 upvotes · 162 comments

The main new detail is not a new paper result but a clearer explanation of how TurboQuant reaches its headline bitrates. Google’s research post frames the method as high-dimensional compression for LLMs and vector search, using PolarQuant for random-rotation quantization and QJL for error correction to reduce KV-cache bottlenecks.

What the newer thread adds is the implementation logic behind “2.5-bit.” The fresh discussion says this comes from a mixed-precision split: as one commenter put it, “32 outlier channels are quantized at 3 bits” while “the remaining 96 channels use 2 bits,” yielding an effective 2.5-bit average thread highlights. That makes the claim more concrete for engineers: TurboQuant is using outlier-aware allocation, not some exotic fractional-bit datatype.

Is this moving beyond a research claim?

Hacker News

Discussion around TurboQuant: Redefining AI efficiency with extreme compression

566 upvotes · 162 comments

The strongest signal here is early porting activity. The Hacker News core summary highlights an “independent working implementation” in PyTorch and says someone is already implementing it in llama.cpp via linked community work PyTorch port and llama.cpp work. That is still community-driven, but it means the technique is already being mapped onto common inference tooling.

A supporting X post from Wes Roth shows “Google Turbo Quant” running on “a standard MacBook Air M4 with just 16GB of RAM,” alongside a short terminal demo MacBook demo. It does not establish throughput, latency, or accuracy, but paired with the PyTorch and llama.cpp efforts it reinforces the near-term story: TurboQuant is being translated into practical memory-saving experiments rather than staying confined to a research blog.

TL;DR

What changed in the TurboQuant discussion?

Is this moving beyond a research claim?

Discussion across the web