Inference Optimization
Techniques that improve cost, latency, throughput, or quality.
Stories
Filter storiesIndependent measurements after DSpark put DeepSeek V4-Pro around 90 tok/s and cut one run from 214s to 116s. The gain matters because it lowers serving cost, though tuning details and memory overhead are still unclear.
DeepSeek open-sourced DeepSpec, a codebase for training and evaluating draft models for speculative decoding, alongside the DSpark decoding module for V4 checkpoints. It matters because inference teams get a new open stack for improving draft-model quality and decode throughput beyond earlier MTP-style baselines.
Perceptron launched a video_frames input for Mk1 that accepts pre-decoded frames with timestamps instead of forcing clip re-encoding. The change matters for edge and sparse-footage pipelines because 10 minutes of 1080p video can start returning tokens roughly ten times faster.
Vercel and Wafer launched a serverless GLM-5.2 endpoint on AI Gateway with 1M context and published pricing. Teams get a high-throughput open-model option inside an existing gateway instead of managing GLM inference directly.
Morph said its code-serving stack now exposes Qwen, GLM-5.2, MiniMax M3, and DeepSeek v4 with code-tuned speculative decoding. It claims 20-35% higher acceptance than Eagle 3.1 or DFlash, plus kernels for cheaper hardware.
Wafer said its GLM-5.2 deployment leads Artificial Analysis on throughput and latency, and priced usage at $1.20 input and $4.10 output per million tokens. Compare serverless and dedicated endpoints if you need speed at scale.
LMSYS and Modal shipped DFlash plus Spec V2 in SGLang, claiming 4.3x baseline throughput and 1.5x native MTP on Qwen3.5-397B-A17B. It cuts latency and serving cost for very large open models.
Together AI said its DeepSeek V4 Pro deployment now leads Artificial Analysis on both output speed and latency. The claim matters because it turns V4 serving into an inference-systems story about KV cache reuse, prefix reuse, kernels, and endpoint profiles rather than model weights alone.
Cohere added MLX support, Unsloth GGUFs, oMLX work, and updated docs for North Mini Code two days after launch, with llama.cpp still under review. The broader runtime coverage makes the 30B coding model easier to run on local Mac, quantized, and self-hosted stacks.
Google released Apache 2.0 DiffusionGemma, a 26B-A4B diffusion text model that claims up to 4x faster output by generating text in blocks instead of one token at a time. The release matters for local and hosted stacks that want to test a new decoding path.
Google's new diffusion text model picked up same-day runtime support: vLLM added native diffusion-LM serving, Unsloth shipped GGUFs, and llama.cpp got local setup guidance. That shortens the path from release to local and hosted evaluation.
Apple said its most powerful on-device model runs on iPhone 17 Pro, while independent analysis describes a 20B design that routes a query to experts loaded from NAND into RAM. The architecture matters because it trades dense inference for hardware-aware expert selection, but access is constrained by device and region limits.
A local benchmark on a 128GB Framework system reported Qwen3-TTS performance close to an M5 Max using a GGML Vulkan backend. The result suggests AMD Strix hardware can approach Apple-class local TTS speed without MLX or Metal.
Google published Gemma 4 QAT checkpoints and mobile-focused quant formats, cutting Gemma 4 E2B to roughly 1GB of memory. Ollama, SGLang, and vLLM added day-one support, making local deployment more practical on phones, laptops, and low-VRAM GPUs.
Google released Gemma 4 12B, an Apache 2.0 encoder-free multimodal model with native audio and vision for 16GB-class laptops. Day-zero support in llama.cpp, vLLM, Ollama, MLX, and SGLang should make local agents and on-device apps easier to deploy immediately.
Perplexity said Computer will split tasks between on-device models and frontier cloud models, keeping some data on the local machine while escalating harder work remotely. That matters for privacy-sensitive workflows and for reducing token-heavy cloud usage on laptop-class hardware.
NVIDIA teased Nemotron 3 Ultra as a 550B open-weight model due later this week, with early messaging centered on 5x faster and 30% cheaper inference plus a hybrid SSM-MoE design. The rollout matters because early benchmark posts already place it near the top of open-weight leaderboards, widening NVIDIA’s open-model push beyond Cosmos.
MiniMax shipped M3 with a 1M-token context window, native multimodal input, and frontier coding claims across SWE-Bench Pro, Terminal Bench, and MCP Atlas. It also appeared on OpenRouter, Ollama Cloud, Venice, Hermes, Cline, Together, and Arena on day one.
vLLM 0.22.0 shipped DeepSeek V4 hardening, a Rust frontend, batch-invariant Cutlass FP8 paths, and multi-tier KV cache offloading. The release also removes deprecated APIs, so some serving stacks will need upgrade work.
Perplexity open-sourced the XLM-RoBERTa Unigram tokenizer it rebuilt for ranking and retrieval, reporting 5-6x lower CPU use and 63 microsecond p50 at 514 tokens. Teams running fast rerankers and embedders should watch tokenization cost as a latency bottleneck.
Alibaba rolled out implicit caching for Qwen3.7 Max, automatically reusing repeated context without user setup. The update also lands with fresh benchmark results and broader coding-agent support across OpenCode and Hermes Agent.
MiniMax started winding down its M2 series while previewing M3 and a new sparse-attention design with large long-context speedup claims. The teaser points to a fresh open-model race around block selection, GQA, and million-token serving efficiency.
Huawei outlined a τ scaling framework and LogicFolding design that shifts chip progress from node shrinkage toward shorter signal delay. The proposal matters because it targets performance, density, and yield gains without relying only on EUV-era process shrinks.
SGLang v0.5.12 added native DeepSeek V4 support with ShadowRadix prefix caching, HiSparse CPU-extended KV, MegaMoE kernels, and Blackwell MLA work. The release broadens hardware targets and improves long-context serving efficiency for open runtimes.
Nous Research published Lighthouse Attention, a hierarchical selection layer that keeps the standard attention kernel while cutting end-to-end pretraining wall clock by 1.4-1.7x at 98K context. It also scales to 1M-token training across 32 Blackwell GPUs without a custom sparse kernel.
Unsloth said its updated Qwen3.5 MTP GGUFs now run about 1.8x faster after llama.cpp added spec-draft-p-min 0.75 and renamed the mode to draft-mtp. The update also raises draft-token settings and expands the small-model MTP set for local runners.
Zyphra released ZAYA1-8B-Diffusion-Preview, its first diffusion language model trained on AMD, and said 16-token block generation delivers 4.6x-7.7x faster decoding with limited quality loss. The design targets autoregressive KV-cache bottlenecks while keeping post-training and test-time compute viable.
Perplexity published serving results for post-trained Qwen3 235B on NVIDIA GB200 NVL72 and argues Blackwell materially outperforms Hopper for large MoE inference. The deltas show up in NVLS all-reduce latency, MoE prefill combine time, and high-speed decode throughput.
Hugging Face released Diffusers 0.38.0 with new audio and image pipelines, Flash Attention 4, FlashPack loading, and Ring Anything for context parallelism. Use the new profiling guidance to tune diffusion performance as you adopt the added model coverage.
OpenBMB released MiniCPM-V 4.6 1.3B, claiming 55.8 percent lower vision-encoding FLOPs, 75.7 ms TTFT on a 4090, and about 1.5x token throughput over Qwen3.5 0.8B. It targets edge deployment across mobile platforms and common inference stacks.
Posts said Qwen3-8B now has a DFlash speculator with 82.2% first-token acceptance and 3.74 accepted tokens per step, alongside broader DFlash claims of over 6x lossless acceleration. It matters because the release turns a decoding paper into a concrete speculative-inference artifact engineers can test against existing Qwen stacks.
Google released Multi-Token Prediction drafters for Gemma 4 and says decoding can run up to 3x faster without output-quality loss. vLLM and SGLang support shipped day one, so local and server deployments can try the speedup immediately.
Zyphra published folded Tensor and Sequence Parallelism, claiming 173M tok/s versus 86M for matched TP+SP on 1,024 MI300X GPUs. The design keeps more replicas inside a node, reducing per-GPU memory pressure and cross-node communication.
Zyphra launched serverless inference on AMD MI355X for DeepSeek V3.2, Kimi K2.6, and GLM 5.1, aimed at long-horizon agent workloads. The service leans on high-HBM nodes to keep more long-context sessions resident and reduce queueing.
The vLLM team shipped more than 10 DeepSeek V4 fixes as developers kept posting V4 Pro and Flash results from coding harnesses and local servers. Use the update if serving bugs, cache behavior, or tool-call reliability are blocking cheaper long-context agent runs.
Moondream shipped Photon 1.2.0, expanding its inference engine to Apple Silicon, Windows CUDA, Blackwell, and Jetson Thor, then outlined how custom Metal kernels and fused ops made local vision practical without MLX. That broadens deployment options for edge and on-device vision workloads while keeping server-class latency on B200 systems.
OpenAI added WebSocket mode to the Responses API and says it cuts repeated work across Codex tool loops, improving end-to-end speed by up to 40%. The change reduces runtime overhead for agent workflows, not just base-model latency.
Alibaba Qwen introduced FlashQLA, a TileLang-based linear-attention kernel stack that reports 2–3x faster forward passes and 2x faster backward passes. The release gives edge and long-context deployments a new optimization lever below the model layer itself.
vLLM 0.20.0 shipped a new CUDA 13 / PyTorch 2.11 / Transformers v5 baseline, TurboQuant 2-bit KV cache, FA4 MLA defaults, and deeper DeepSeek V4 support. The release changes serving baselines across NVIDIA, AMD, Intel, and ARM-CUDA setups, including 4x KV capacity and a clearer upgrade path for teams already running V4.
Builders published new MLX and 3-bit Qwen3.6 quants and shared reproducible local benchmarks from M3 Ultra, RTX 5070, and Radeon AI Pro setups. That gives local-agent teams concrete deployment options beyond launch-day claims, though memory budgets and long-context tool use still limit larger workflows.
DeepSeek lowered V4-Pro API pricing and updated integration guidance for Claude Code, OpenCode, and OpenClaw a day after V4 launched. Check whether V4-Flash is the easier deploy today, while Pro stays heavier and more rate-limited.
SGLang and Miles published a technical breakdown of their DeepSeek V4 day-zero stack, including ShadowRadix caching, Flash Compressor, FP4 expert-weight handling, and measured B200/H200 throughput. That gives deployers concrete serving and training-path numbers for V4 beyond generic launch-day compatibility claims.
Engineers unpacked DeepSeek V4's hybrid CSA/HCA attention a day after launch; it claims 27% of V3.2 FLOPs and 10% of its KV cache at 1M tokens. External tests pushed V4 Pro near the top of open-model indexes, but users also reported rate limits and mixed third-party results.
DeepSeek open-sourced V4-Pro and V4-Flash under MIT, with 1M context and aggressive Flash pricing. Day-one support in SGLang, vLLM, and OpenRouter pushes open-weight agentic coding closer to closed frontier models.
DeepSeek published Tile Kernels, an open-source TileLang repo covering Engram, mHC, MoE routing, and FP4/FP8 kernels, with claims that some are already used in internal training and inference. That matters because it exposes reusable low-level performance work behind DeepSeek’s stack instead of keeping the kernels fully private.
Google unveiled eighth-generation TPUs split into TPU 8t for training and TPU 8i for inference, saying 8t delivers nearly 3x per-pod compute over Ironwood while 8i links 1,152 chips in a pod. Google is tuning its hardware stack for larger training runs and lower-latency agent inference at cloud scale.
Moonshot says its Prefill-as-a-Service setup makes prefill/decode disaggregation practical across datacenters and mixed hardware by shrinking KV cache with Kimi Linear. The paper reports 1.54x throughput and a 64% drop in P90 time-to-first-token, so benchmark the approach before planning production adoption.
Unsloth published GGUF quant benchmarks for Qwen3.6-35B-A3B while practitioners shared local setup guides and long-context agent runs on Apple silicon and high-RAM desktops. The sparse 35B model is becoming a credible local coding-agent option, but speed and reasoning quality still vary by quant and offload strategy.
Together AI and UCSD released Parcae, a looped model that reuses layers with a constrained recurrent dynamic and reports stronger results than parameter-matched Transformers from 140M to 1.3B scales. The released models and code suggest recurrence can trade memory for quality under fixed FLOP budgets instead of scaling parameters alone.
Hugging Face introduced Kernels on the Hub to publish pre-compiled GPU kernels matched to GPU, PyTorch version, and OS. The packaging makes kernel optimizations shareable and claims 1.7x to 2.5x speedups over PyTorch baselines with torch.compile compatibility.