Skip to content
AI Primer
TOPIC42 stories

Inference Optimization

Techniques that improve cost, latency, throughput, or quality.

NEWS12th May
Perplexity benchmarks Qwen3 235B on GB200 NVL72: NVLS latency drops from 586 µs to 313 µs

Perplexity published serving results for post-trained Qwen3 235B on NVIDIA GB200 NVL72 and argues Blackwell materially outperforms Hopper for large MoE inference. The deltas show up in NVLS all-reduce latency, MoE prefill combine time, and high-speed decode throughput.

RELEASE12th May
Diffusers 0.38.0 adds Ace-Step 1.5 pipelines and Flash Attention 4 support

Hugging Face released Diffusers 0.38.0 with new audio and image pipelines, Flash Attention 4, FlashPack loading, and Ring Anything for context parallelism. Use the new profiling guidance to tune diffusion performance as you adopt the added model coverage.

RELEASE11th May
OpenBMB releases MiniCPM-V 4.6 1.3B with 75.7 ms TTFT and 19x token efficiency

OpenBMB released MiniCPM-V 4.6 1.3B, claiming 55.8 percent lower vision-encoding FLOPs, 75.7 ms TTFT on a 4090, and about 1.5x token throughput over Qwen3.5 0.8B. It targets edge deployment across mobile platforms and common inference stacks.

RELEASE10th May
DFlash adds Qwen3-8B speculator with 82.2% first-token acceptance

Posts said Qwen3-8B now has a DFlash speculator with 82.2% first-token acceptance and 3.74 accepted tokens per step, alongside broader DFlash claims of over 6x lossless acceleration. It matters because the release turns a decoding paper into a concrete speculative-inference artifact engineers can test against existing Qwen stacks.

NEWS9th May
Gemma 4 MTP benchmarks 138 tok/s in llama.cpp on M5 Max

Community ports brought Gemma 4 multi-token prediction into llama.cpp and MLX Swift, with one M5 Max report moving from 97 to 138 tok/s and another showing 30-40% faster decoding. The gains extend MTP into local runtimes used for on-device coding and long-context work.

RELEASE1w ago
Gemma 4 adds MTP drafters for up to 3x faster decoding

Google released Multi-Token Prediction drafters for Gemma 4 and says decoding can run up to 3x faster without output-quality loss. vLLM and SGLang support shipped day one, so local and server deployments can try the speedup immediately.

RELEASE1w ago
Zyphra releases folded TSP with 173M tok/s on 1,024 MI300X GPUs

Zyphra published folded Tensor and Sequence Parallelism, claiming 173M tok/s versus 86M for matched TP+SP on 1,024 MI300X GPUs. The design keeps more replicas inside a node, reducing per-GPU memory pressure and cross-node communication.

RELEASE1w ago
Zyphra Inference launches MI355X endpoints for DeepSeek V3.2, Kimi K2.6, and GLM 5.1

Zyphra launched serverless inference on AMD MI355X for DeepSeek V3.2, Kimi K2.6, and GLM 5.1, aimed at long-horizon agent workloads. The service leans on high-HBM nodes to keep more long-context sessions resident and reduce queueing.

RELEASE1w ago
vLLM 0.20.1 fixes DeepSeek V4 TopK deadlocks and tool-call errors

The vLLM team shipped more than 10 DeepSeek V4 fixes as developers kept posting V4 Pro and Flash results from coding harnesses and local servers. Use the update if serving bugs, cache behavior, or tool-call reliability are blocking cheaper long-context agent runs.

RELEASE1w ago
Moondream releases Photon 1.2.0 with Apple Silicon, native Windows CUDA, and 23 ms B200 latency

Moondream shipped Photon 1.2.0, expanding its inference engine to Apple Silicon, Windows CUDA, Blackwell, and Jetson Thor, then outlined how custom Metal kernels and fused ops made local vision practical without MLX. That broadens deployment options for edge and on-device vision workloads while keeping server-class latency on B200 systems.

RELEASE2w ago
FlashQLA releases TileLang linear-attention kernels with 2–3x forward speedups

Alibaba Qwen introduced FlashQLA, a TileLang-based linear-attention kernel stack that reports 2–3x faster forward passes and 2x faster backward passes. The release gives edge and long-context deployments a new optimization lever below the model layer itself.

RELEASE2w ago
OpenAI adds WebSocket mode to Responses API for 40% faster Codex loops

OpenAI added WebSocket mode to the Responses API and says it cuts repeated work across Codex tool loops, improving end-to-end speed by up to 40%. The change reduces runtime overhead for agent workflows, not just base-model latency.

RELEASE2w ago
vLLM 0.20.0 releases TurboQuant 2-bit KV cache, CUDA 13 baseline, and DeepSeek V4 upgrades

vLLM 0.20.0 shipped a new CUDA 13 / PyTorch 2.11 / Transformers v5 baseline, TurboQuant 2-bit KV cache, FA4 MLA defaults, and deeper DeepSeek V4 support. The release changes serving baselines across NVIDIA, AMD, Intel, and ARM-CUDA setups, including 4x KV capacity and a clearer upgrade path for teams already running V4.

NEWS2w ago
Qwen3.6 community ships MLX and 3-bit quants with 40-56 tok/s local agent runs

Builders published new MLX and 3-bit Qwen3.6 quants and shared reproducible local benchmarks from M3 Ultra, RTX 5070, and Radeon AI Pro setups. That gives local-agent teams concrete deployment options beyond launch-day claims, though memory budgets and long-context tool use still limit larger workflows.

NEWS2w ago
DeepSeek cuts V4-Pro API 75% to $0.43/$0.87 per 1M tokens through May 5

DeepSeek lowered V4-Pro API pricing and updated integration guidance for Claude Code, OpenCode, and OpenClaw a day after V4 launched. Check whether V4-Flash is the easier deploy today, while Pro stays heavier and more rate-limited.

NEWS2w ago
SGLang supports DeepSeek V4 with 199 tok/s on B200 and 240 tok/s at 900K context

SGLang and Miles published a technical breakdown of their DeepSeek V4 day-zero stack, including ShadowRadix caching, Flash Compressor, FP4 expert-weight handling, and measured B200/H200 throughput. That gives deployers concrete serving and training-path numbers for V4 beyond generic launch-day compatibility claims.

RELEASE2w ago
DeepSeek V4 reports CSA/HCA attention and 10% KV cache at 1M context

Engineers unpacked DeepSeek V4's hybrid CSA/HCA attention a day after launch; it claims 27% of V3.2 FLOPs and 10% of its KV cache at 1M tokens. External tests pushed V4 Pro near the top of open-model indexes, but users also reported rate limits and mixed third-party results.

RELEASE2w ago
DeepSeek releases V4-Pro and V4-Flash with 1M context and $0.14/M input

DeepSeek open-sourced V4-Pro and V4-Flash under MIT, with 1M context and aggressive Flash pricing. Day-one support in SGLang, vLLM, and OpenRouter pushes open-weight agentic coding closer to closed frontier models.

RELEASE2w ago
DeepSeek releases Tile Kernels with Engram, mHC, and FP4/FP8 ops for SM90 and SM100 GPUs

DeepSeek published Tile Kernels, an open-source TileLang repo covering Engram, mHC, MoE routing, and FP4/FP8 kernels, with claims that some are already used in internal training and inference. That matters because it exposes reusable low-level performance work behind DeepSeek’s stack instead of keeping the kernels fully private.

NEWS3w ago
Google launches TPU 8t and TPU 8i with 3x pod compute and 1,152-chip inference pods

Google unveiled eighth-generation TPUs split into TPU 8t for training and TPU 8i for inference, saying 8t delivers nearly 3x per-pod compute over Ironwood while 8i links 1,152 chips in a pod. Google is tuning its hardware stack for larger training runs and lower-latency agent inference at cloud scale.

NEWS3w ago
Moonshot claims 1.54x throughput and 64% lower P90 TTFT with cross-datacenter prefill

Moonshot says its Prefill-as-a-Service setup makes prefill/decode disaggregation practical across datacenters and mixed hardware by shrinking KV cache with Kimi Linear. The paper reports 1.54x throughput and a 64% drop in P90 time-to-first-token, so benchmark the approach before planning production adoption.

WORKFLOW3w ago
Unsloth benchmarks Qwen3.6-35B-A3B GGUF quants at 20-40 tok/s on local rigs

Unsloth published GGUF quant benchmarks for Qwen3.6-35B-A3B while practitioners shared local setup guides and long-context agent runs on Apple silicon and high-RAM desktops. The sparse 35B model is becoming a credible local coding-agent option, but speed and reasoning quality still vary by quant and offload strategy.

NEWS4w ago
Parcae claims 1.3B Transformer quality from a 770M looped model

Together AI and UCSD released Parcae, a looped model that reuses layers with a constrained recurrent dynamic and reports stronger results than parameter-matched Transformers from 140M to 1.3B scales. The released models and code suggest recurrence can trade memory for quality under fixed FLOP budgets instead of scaling parameters alone.

RELEASE4w ago
Hugging Face Hub launches Kernels with 1.7x-2.5x PyTorch speedups

Hugging Face introduced Kernels on the Hub to publish pre-compiled GPU kernels matched to GPU, PyTorch version, and OS. The packaging makes kernel optimizations shareable and claims 1.7x to 2.5x speedups over PyTorch baselines with torch.compile compatibility.

RELEASE1mo ago
Ollama adds MLX preview on Apple Silicon with reported 2.2x speedups

Ollama's Apple Silicon preview switches local inference to MLX, and users reportedly see sizable speedups with some Qwen3.5 variants on M-series Macs. Try it if you run local coding agents, since faster prefill and caching can cut session reload time.

NEWS1mo ago
TurboQuant updates 2.5-bit mixed precision with PyTorch and llama.cpp ports

New discussion around TurboQuant focuses on its 2.5-bit mixed-precision setup and working PyTorch and llama.cpp implementations. The technique is moving from a research claim into deployable KV-cache compression with concrete porting details.

NEWS1mo ago
TurboQuant cuts KV cache memory 6x with 3-bit storage

Google Research said TurboQuant can shrink KV cache storage to 3 bits with roughly 6x less memory, and early implementations already surfaced in PyTorch, llama.cpp, and Atomic Chat. The work targets a core inference bottleneck for long-context serving on local and server hardware.

RELEASE1mo ago
Meta ships SAM 3.1 with object multiplexing for 16 tracked objects

SAM 3.1 is a drop-in update that shares video computation across up to 16 tracked objects instead of rerunning most of the model per object. Meta's H100 numbers show roughly 30 FPS at 16 objects versus under 10 FPS for SAM 3, which cuts multi-object video tracking cost.

NEWS1mo ago
Google Research launches TurboQuant: 6x KV-cache compression, 8x faster H100 attention

TurboQuant claims 6x KV-cache memory reduction and up to 8x faster attention on H100s without retraining or quality loss on long-context tasks. If those results hold in serving stacks, teams should revisit long-context cost, capacity, and vector-search design.

NEWS1mo ago
Flash-MoE claims Qwen3.5-397B runs on iPhone at 0.6 tokens/sec via SSD streaming

Flash-MoE now shows SSD-streamed expert weights pushing a 397B Qwen3.5 variant onto an iPhone at 0.6 tokens per second, extending its earlier laptop demos. Treat it as a memory-tiering prototype rather than a deployable mobile serving target, because speed, heat, and context headroom remain tight.

RELEASE1mo ago
Miles adds ROCm support on AMD Instinct and raises AIME to 0.729

Miles added ROCm support for AMD Instinct clusters and reported GRPO post-training gains on Qwen3-30B-A3B, including AIME rising from 0.665 to 0.729. It matters if you are evaluating rollout-heavy RL jobs off NVIDIA and want concrete throughput and step-time numbers before porting.

RELEASE1mo ago
Flash-MoE claims 4.4 tokens/sec on Qwen3.5-397B on 48GB M3 Max

A pure C and Metal engine streams 209GB of MoE weights from SSD and reports tool-calling support in 4-bit mode on a laptop-class Mac. It is a concrete benchmark for teams exploring expert streaming, quantization, and page-cache tricks on consumer hardware.

RELEASE1mo ago
Mamba-3 updates its inference path with MIMO decode and new state updates

New write-ups on Mamba-3 add more detail on its MIMO decode path, discretization changes, and complex-valued state updates. That gives infra teams a clearer basis for testing state-space models as inference-efficient alternatives in long-sequence or agent-heavy systems.

RELEASE1mo ago
Morph launches FlashCompact: 33k tok/s compaction from 200k to 50k in 1.5s

Morph released FlashCompact, a specialized compaction model and SDK for coding agents, claiming 33k tokens per second and near-invisible long-context compression. Use it or copy the approach if compaction latency and noisy tool output are blocking longer agent runs.

RELEASE1mo ago
Hao AI Lab launches Dreamverse: 30s 1080p video in 4.5s on one GPU

Dreamverse paired Hao AI Lab's FastVideo stack with an interface for editing video scenes in a faster-than-playback loop, using quantization and fused kernels to keep latency below viewing time. The stack is interesting if you are building real-time multimodal generation or multi-user video serving.

RELEASE1mo ago
Unsloth releases Studio: local training UI for 500+ models with 70% less VRAM

Unsloth Studio launched as an open-source web UI to run, fine-tune, compare, and export local models, with file-to-dataset workflows and sandboxed code execution. Try it if you want to move prototype training and evaluation off cloud notebooks and onto local or rented boxes.

RELEASE1mo ago
Together releases Mamba-3 with MIMO decoding and 1.5B fastest prefill plus decode

Together introduced Mamba-3 and open-sourced kernels for a new MIMO state-space variant that targets decode efficiency and beats Mamba-2, GDN, and Llama 3.2 1B at 1.5B scale. Test it when deployment speed matters more than chasing another generic Transformer baseline.

NEWS1mo ago
DistCA claims 1.35x long-context training gains with disaggregated core attention

Researchers released DistCA, a training system that offloads stateless core attention to dedicated servers and reports up to 1.35x throughput gains on long-context workloads. Evaluate it for very long-sequence training where attention imbalance strands GPUs and creates pipeline stalls.

NEWS1mo ago
Moonshot introduces Attention Residuals with 1.25x compute gains on Kimi Linear

Moonshot introduced Attention Residuals, replacing fixed depth-wise residual accumulation with learned lookbacks over earlier layers, and reports a 1.25x compute advantage on Kimi Linear. Try it as a drop-in lever for deeper stacks, but verify memory tradeoffs and downstream gains on your own architecture.

NEWS1mo ago
FlashAttention-4 benchmarks 1613 TFLOPs/s on B200, 1.3x over cuDNN 9.13

FlashAttention-4 targets Blackwell bottlenecks with redesigned pipelines, software-emulated exponential work, and lower shared-memory traffic, reaching up to 1613 TFLOPs/s on B200. If you serve long-context models on B200 or GB200, benchmark it against your current cuDNN and Triton kernels before optimizing elsewhere.

RELEASE2mo ago
FastVideo claims 5-second 1080p generation in 4.55s on one GPU

FastVideo published an LTX-2.3 inference stack that claims 5-second 1080p text-image-to-audio-video generation in 4.55 seconds on a single GPU. If the results hold up, test it for lower-cost interactive video generation and faster iteration loops.

NEWS2mo ago
Google LiteRT-LM PR adds Gemma4 NPU support ahead of an expected release

A Google bot-authored LiteRT-LM pull request references Gemma4 and AIcore NPU support, while multiple posts claim a largest version around 120B total and 15B active parameters. Engineers targeting on-device inference should wait for a formal model card before locking plans.

AI PrimerAI Primer

Your daily guide to AI tools, workflows, and creative inspiration.

© 2026 AI Primer. All rights reserved.