Skip to content
AI Primer
TOPIC50 stories

LLM Serving

Serving stacks and runtime systems for model inference.

NEWS12th May
Perplexity benchmarks Qwen3 235B on GB200 NVL72: NVLS latency drops from 586 µs to 313 µs

Perplexity published serving results for post-trained Qwen3 235B on NVIDIA GB200 NVL72 and argues Blackwell materially outperforms Hopper for large MoE inference. The deltas show up in NVLS all-reduce latency, MoE prefill combine time, and high-speed decode throughput.

NEWS10th May
Local users report DeepSeek V4 Flash, Qwen 3.6, and Gemma 4 at 40-200 tok/s on Macs and 3090s

Developers posted new local-model measurements for DS4, Qwen 3.6, and Gemma 4: about 40 tok/s on an M3 Ultra, 70+ tok/s on MacBooks with MPS, and 120-200 tok/s for Qwen3.6-27B on a single RTX 3090. The numbers suggest coding-capable local runs are moving from demos toward regular use.

NEWS9th May
Gemma 4 MTP benchmarks 138 tok/s in llama.cpp on M5 Max

Community ports brought Gemma 4 multi-token prediction into llama.cpp and MLX Swift, with one M5 Max report moving from 97 to 138 tok/s and another showing 30-40% faster decoding. The gains extend MTP into local runtimes used for on-device coding and long-context work.

RELEASE1w ago
Gemma 4 adds MTP drafters for up to 3x faster decoding

Google released Multi-Token Prediction drafters for Gemma 4 and says decoding can run up to 3x faster without output-quality loss. vLLM and SGLang support shipped day one, so local and server deployments can try the speedup immediately.

RELEASE1w ago
Zyphra releases folded TSP with 173M tok/s on 1,024 MI300X GPUs

Zyphra published folded Tensor and Sequence Parallelism, claiming 173M tok/s versus 86M for matched TP+SP on 1,024 MI300X GPUs. The design keeps more replicas inside a node, reducing per-GPU memory pressure and cross-node communication.

RELEASE1w ago
Zyphra Inference launches MI355X endpoints for DeepSeek V3.2, Kimi K2.6, and GLM 5.1

Zyphra launched serverless inference on AMD MI355X for DeepSeek V3.2, Kimi K2.6, and GLM 5.1, aimed at long-horizon agent workloads. The service leans on high-HBM nodes to keep more long-context sessions resident and reduce queueing.

RELEASE1w ago
vLLM 0.20.1 fixes DeepSeek V4 TopK deadlocks and tool-call errors

The vLLM team shipped more than 10 DeepSeek V4 fixes as developers kept posting V4 Pro and Flash results from coding harnesses and local servers. Use the update if serving bugs, cache behavior, or tool-call reliability are blocking cheaper long-context agent runs.

NEWS1w ago
Developers report DeepSeek V4 Flash handles 32M-token coding runs for $0.25

Users reported moving long coding sessions from Claude to DeepSeek V4 Flash and seeing tens of millions of tokens cost only cents. Hacker News discussion also leaned toward Flash over Pro for day-to-day use, so teams should test whether the low published prices hold in their own workflows.

RELEASE2w ago
IBM releases Granite 4.1 30B/8B/3B open models under Apache 2.0

IBM released Granite 4.1 as three open instruct models, with third parties quickly surfacing token-efficiency and deployment access. The update matters for teams evaluating smaller open models for agent workloads where output-token burn and openness both affect production cost.

RELEASE2w ago
Nemotron 3 Nano Omni launches 30B-A3B multimodal model with 256K context

NVIDIA opened Nemotron 3 Nano Omni, a 30B-A3B model for text, image, audio, and video, with day-one serving support. That lets teams run one open model for perception-heavy agents instead of stitching separate components.

RELEASE2w ago
Poolside releases Laguna M.1 and XS.2 coding models with 225B/23B and 33B/3B MoEs

Poolside opened Laguna M.1 and Laguna XS.2 as its first public coding models, with Apache 2.0 weights and same-day provider support. That gives teams open coding models that can run locally or through standard serving stacks.

NEWS2w ago
Bedrock adds OpenAI models and stateful runtime in coming weeks

AWS says OpenAI models will land on Bedrock in coming weeks alongside a new stateful runtime. OpenAI also said its Microsoft partnership is now non-exclusive, which opens a multi-cloud path for deployment and procurement.

RELEASE2w ago
OpenClaw 2026.4.26 adds Google Live Talk, openclaw migrate, and Matrix E2EE

OpenClaw 2026.4.26 shipped Google Live Talk, local-model fixes, openclaw migrate imports for Claude and Hermes, and one-command Matrix E2EE. It also hardens plugins, Docker, and transcript compaction for self-hosted agent runs.

RELEASE2w ago
MiMo-V2.5 opens under MIT with 1M context and SGLang vLLM support

Xiaomi opened MiMo-V2.5 and MiMo-V2.5-Pro under MIT, adding a 1M-context multimodal agent model and a 42B-active Pro variant. SGLang and vLLM published day-one recipes, making the series immediately deployable.

RELEASE2w ago
vLLM 0.20.0 releases TurboQuant 2-bit KV cache, CUDA 13 baseline, and DeepSeek V4 upgrades

vLLM 0.20.0 shipped a new CUDA 13 / PyTorch 2.11 / Transformers v5 baseline, TurboQuant 2-bit KV cache, FA4 MLA defaults, and deeper DeepSeek V4 support. The release changes serving baselines across NVIDIA, AMD, Intel, and ARM-CUDA setups, including 4x KV capacity and a clearer upgrade path for teams already running V4.

NEWS2w ago
DeepSeek cuts input cache-hit price 90% to $0.003625 per 1M tokens

DeepSeek said cache-hit pricing across its API series is now one-tenth of launch levels, on top of the temporary V4-Pro discount through May 5. The cut lowers costs for cache-heavy long-context and agent workloads, so teams should recheck spend assumptions.

NEWS2w ago
Qwen3.6 community ships MLX and 3-bit quants with 40-56 tok/s local agent runs

Builders published new MLX and 3-bit Qwen3.6 quants and shared reproducible local benchmarks from M3 Ultra, RTX 5070, and Radeon AI Pro setups. That gives local-agent teams concrete deployment options beyond launch-day claims, though memory budgets and long-context tool use still limit larger workflows.

NEWS2w ago
DeepSeek cuts V4-Pro API 75% to $0.43/$0.87 per 1M tokens through May 5

DeepSeek lowered V4-Pro API pricing and updated integration guidance for Claude Code, OpenCode, and OpenClaw a day after V4 launched. Check whether V4-Flash is the easier deploy today, while Pro stays heavier and more rate-limited.

NEWS2w ago
SGLang supports DeepSeek V4 with 199 tok/s on B200 and 240 tok/s at 900K context

SGLang and Miles published a technical breakdown of their DeepSeek V4 day-zero stack, including ShadowRadix caching, Flash Compressor, FP4 expert-weight handling, and measured B200/H200 throughput. That gives deployers concrete serving and training-path numbers for V4 beyond generic launch-day compatibility claims.

RELEASE2w ago
DeepSeek V4 reports CSA/HCA attention and 10% KV cache at 1M context

Engineers unpacked DeepSeek V4's hybrid CSA/HCA attention a day after launch; it claims 27% of V3.2 FLOPs and 10% of its KV cache at 1M tokens. External tests pushed V4 Pro near the top of open-model indexes, but users also reported rate limits and mixed third-party results.

NEWS2w ago
DeepSeek V4 adds day-1 support from vLLM, SGLang, Ollama, OpenCode, Venice, and Together

Within a day of launch, vLLM, SGLang, Ollama cloud, OpenCode, Venice, Together, and Baseten added support or hosted access for DeepSeek V4. That makes Flash and Pro easier to test across local, routed, and managed agent stacks.

RELEASE3w ago
DeepSeek releases V4-Pro and V4-Flash with 1M context and $0.14/M input

DeepSeek open-sourced V4-Pro and V4-Flash under MIT, with 1M context and aggressive Flash pricing. Day-one support in SGLang, vLLM, and OpenRouter pushes open-weight agentic coding closer to closed frontier models.

RELEASE3w ago
Tencent launches Hy3 preview with 295B/21B, 256K context, and day-one OpenRouter, vLLM, and SGLang support

Tencent open-sourced Hy3 preview, a 295B MoE with 21B active parameters and 256K context, then pushed it into OpenRouter, OpenCode, OpenClaw, vLLM, and SGLang immediately. That matters because engineers can test and deploy a new reasoning-agent model on day one instead of waiting for the runtime ecosystem to catch up.

RELEASE3w ago
DeepSeek releases Tile Kernels with Engram, mHC, and FP4/FP8 ops for SM90 and SM100 GPUs

DeepSeek published Tile Kernels, an open-source TileLang repo covering Engram, mHC, MoE routing, and FP4/FP8 kernels, with claims that some are already used in internal training and inference. That matters because it exposes reusable low-level performance work behind DeepSeek’s stack instead of keeping the kernels fully private.

RELEASE3w ago
Qwen3.6-27B releases with 77.2 SWE-Bench Verified and Apache 2.0

Alibaba released Qwen3.6-27B, a dense open model with multimodal input and thinking or non-thinking modes that beats Qwen3.5-397B-A17B across major coding benchmarks. Day-one support across vLLM, SGLang, Ollama, llama.cpp, GGUF, and MLX makes it ready for local and hosted coding agents.

NEWS3w ago
Kimi K2.6 adds free Hermes and Cline access plus Replicate, Perplexity, and Together support

A day after Kimi K2.6’s launch, providers and tools opened new access paths including temporary free use in Hermes and Cline plus availability on Replicate, Together, Perplexity, and Tinker. Engineers can test the open model across agent harnesses and hosted runtimes without standing up their own stack first.

RELEASE3w ago
Xiaomi MiMo-V2.5-Pro releases with 57.2 SWE-Bench Pro, 1M context, and OpenRouter access

Xiaomi’s MiMo-V2.5-Pro and MiMo-V2.5 arrived with million-token context windows, stronger coding and agentic claims, and immediate access through OpenRouter plus agent harnesses. The rollout adds another low-cost Chinese frontier model that engineers can route into coding workflows without waiting for a proprietary IDE deal.

RELEASE3w ago
Kimi K2.6 launches API with $0.95/M input, 256K context, and video input

Moonshot put Kimi K2.6 on API with cache-hit/cache-miss pricing, tool calls, JSON modes, and native text-image-video input. It also open-sourced FlashKDA and landed in Warp, Cosine, Genspark, and OpenClaw, making the launch usable coding-agent infrastructure.

NEWS3w ago
Kimi K2.6 adds day-one support across vLLM, SGLang, Ollama, and OpenRouter

Kimi K2.6 shipped across vLLM, SGLang, OpenRouter, Baseten, Ollama, OpenCode, Hermes Agent, and Droid within hours of launch. That cuts the usual lag between model release and production trials, so mixed-provider agent stacks can test it sooner.

NEWS3w ago
Gemma 4 ecosystem ships 60+ on-device demos and local agent benchmarks

A weekend of Gemma 4 demos spanned YC hackathon projects, offline iPhone runs, and HN reports of strong local coding and SQL-agent performance. Gemma 4 is increasingly showing up as a practical edge model for tool use and multimodal apps, not just a release benchmark.

NEWS3w ago
Qwen3.6-35B-A3B benchmarks 40 tok/s on M3 Ultra with Strix Halo follow-ups

Fresh local reports put Qwen3.6-35B-A3B around 40 tok/s on M3 Ultra, extended testing to Strix Halo, and wired it into OpenClaw and Pi-style harnesses. The update matters because Qwen3.6 is moving from quant benchmarks into real local coding-agent loops with clearer hardware limits.

NEWS3w ago
Moonshot claims 1.54x throughput and 64% lower P90 TTFT with cross-datacenter prefill

Moonshot says its Prefill-as-a-Service setup makes prefill/decode disaggregation practical across datacenters and mixed hardware by shrinking KV cache with Kimi Linear. The paper reports 1.54x throughput and a 64% drop in P90 time-to-first-token, so benchmark the approach before planning production adoption.

RELEASE4w ago
Qwen3.6-35B-A3B releases Apache 2.0 sparse MoE with 3B active params

Alibaba open-sourced Qwen3.6-35B-A3B, a 35B multimodal sparse MoE with only 3B active parameters under Apache 2.0. Same-day support from vLLM, Ollama, SGLang, and GGUF builders makes it immediately usable for local and production coding workloads.

RELEASE4w ago
MiniMax M2.7 supports 128 GB GGUF runs and day-0 cloud hosting

MiniMax M2.7 moved from announcement to deployment, with GGUF guidance for 128 GB local systems and same-day availability on Together, Fireworks, Hugging Face, and ModelScope. Use the local and managed serving options now, but check the non-commercial license before adopting the 230B model.

RELEASE4w ago
MiniMax releases M2.7 open model with 56.22% SWE-Pro and 57.0% Terminal Bench 2

MiniMax open-sourced M2.7 and published coding and agent benchmark claims including 56.22% SWE-Pro and 57.0% Terminal Bench 2. Day-zero support from SGLang, vLLM, Ollama Cloud, Together AI, and NVIDIA NIM makes it easy to try on common serving stacks.

NEWS1mo ago
GLM-5.1 lands on Modal, Together AI, Letta Code, and Tembo

Providers and agent platforms added GLM-5.1 endpoints across Modal, Together AI, Letta Code, Tembo, and Tabbit, with free trials, no-key access, and 99.9% SLA options. Use the new hosting options to test the model for coding and long-horizon agent workloads without waiting on self-hosting.

NEWS1mo ago
Gemma 4 26B-A4B benchmarks at ~40 tokens/s in Mac code-agent tests

HN practitioners report Gemma 4 26B-A4B near 40 tokens per second in code-agent harnesses on Mac-class hardware, and Unsloth published a free Colab fine-tuning flow. Use the local benchmark as a practical reference and the Colab path if you want task-specific tuning without added cost.

WORKFLOW1mo ago
Gemma 4 26B-A4B runs at 30K context on 16 GB VRAM in community configs

Users published reproducible 16 GB VRAM and Apple Silicon setups for the Gemma 4 26B-A4B and 31B variants. Google’s AI Gallery app also brought offline Gemma chat to phones. The setups make local coding and vision work more practical, but runtime choice, quantization, and recent llama.cpp regressions still affect reliability.

RELEASE1mo ago
Ollama adds MLX preview on Apple Silicon with reported 2.2x speedups

Ollama's Apple Silicon preview switches local inference to MLX, and users reportedly see sizable speedups with some Qwen3.5 variants on M-series Macs. Try it if you run local coding agents, since faster prefill and caching can cut session reload time.

NEWS1mo ago
TurboQuant updates 2.5-bit mixed precision with PyTorch and llama.cpp ports

New discussion around TurboQuant focuses on its 2.5-bit mixed-precision setup and working PyTorch and llama.cpp implementations. The technique is moving from a research claim into deployable KV-cache compression with concrete porting details.

RELEASE1mo ago
Arm launches AGI CPU with 136 Neoverse V3 cores and 272-core blade

Arm introduced its first production server chip under its own banner, with up to 136 Neoverse V3 cores and a 272-core dual-node reference blade. The launch pushes Arm deeper into direct datacenter silicon for agentic AI workloads, not just IP licensing.

RELEASE1mo ago
tinybox ships red v2 with 4x 9070 XT and 64 GB GPU RAM for $12,000

tiny corp is shipping tinybox red v2 at $12,000 with four 9070 XT GPUs and 64 GB of GPU memory, alongside higher-end Blackwell systems. Buyers are weighing the bundled tinygrad stack against DIY rigs, model-fit limits, and cloud economics.

NEWS1mo ago
LiteLLM 1.82.8 ships malicious .pth credential stealer on PyPI

Compromised LiteLLM 1.82.7 and 1.82.8 wheels executed a malicious .pth file at install time to exfiltrate credentials, and PyPI quarantined the releases. Treat fresh-package installs and AI infra dependencies as supply-chain risk, and check startup hooks on affected systems.

NEWS1mo ago
TurboQuant cuts KV cache memory 6x with 3-bit storage

Google Research said TurboQuant can shrink KV cache storage to 3 bits with roughly 6x less memory, and early implementations already surfaced in PyTorch, llama.cpp, and Atomic Chat. The work targets a core inference bottleneck for long-context serving on local and server hardware.

RELEASE1mo ago
Cohere launches Transcribe 03-2026 with 14 languages and Apache 2.0 weights

Cohere released a 2B speech-to-text model with 14 languages and top Open ASR scores, and upstreamed encoder-decoder optimizations to vLLM in the same launch. It is a self-hosted ASR option, so test accuracy and throughput on your own speech workload.

NEWS1mo ago
Google Research launches TurboQuant: 6x KV-cache compression, 8x faster H100 attention

TurboQuant claims 6x KV-cache memory reduction and up to 8x faster attention on H100s without retraining or quality loss on long-context tasks. If those results hold in serving stacks, teams should revisit long-context cost, capacity, and vector-search design.

NEWS1mo ago
Flash-MoE claims Qwen3.5-397B runs on iPhone at 0.6 tokens/sec via SSD streaming

Flash-MoE now shows SSD-streamed expert weights pushing a 397B Qwen3.5 variant onto an iPhone at 0.6 tokens per second, extending its earlier laptop demos. Treat it as a memory-tiering prototype rather than a deployable mobile serving target, because speed, heat, and context headroom remain tight.

RELEASE1mo ago
Miles adds ROCm support on AMD Instinct and raises AIME to 0.729

Miles added ROCm support for AMD Instinct clusters and reported GRPO post-training gains on Qwen3-30B-A3B, including AIME rising from 0.665 to 0.729. It matters if you are evaluating rollout-heavy RL jobs off NVIDIA and want concrete throughput and step-time numbers before porting.

RELEASE1mo ago
Flash-MoE claims 4.4 tokens/sec on Qwen3.5-397B on 48GB M3 Max

A pure C and Metal engine streams 209GB of MoE weights from SSD and reports tool-calling support in 4-bit mode on a laptop-class Mac. It is a concrete benchmark for teams exploring expert streaming, quantization, and page-cache tricks on consumer hardware.

RELEASE1mo ago
OpenAI adds container pools to Responses API for 10x faster agent spin-up

OpenAI says Responses API requests can reuse warm containers for skills, shell, and code interpreter, cutting startup times by about 10x. Faster execution matters more now that Codex is spreading to free users, students, and subagent-heavy workflows.

AI PrimerAI Primer

Your daily guide to AI tools, workflows, and creative inspiration.

© 2026 AI Primer. All rights reserved.