LLM Serving
Serving stacks and runtime systems for model inference.
Stories
Filter storiesPerplexity published serving results for post-trained Qwen3 235B on NVIDIA GB200 NVL72 and argues Blackwell materially outperforms Hopper for large MoE inference. The deltas show up in NVLS all-reduce latency, MoE prefill combine time, and high-speed decode throughput.
Developers posted new local-model measurements for DS4, Qwen 3.6, and Gemma 4: about 40 tok/s on an M3 Ultra, 70+ tok/s on MacBooks with MPS, and 120-200 tok/s for Qwen3.6-27B on a single RTX 3090. The numbers suggest coding-capable local runs are moving from demos toward regular use.
Community ports brought Gemma 4 multi-token prediction into llama.cpp and MLX Swift, with one M5 Max report moving from 97 to 138 tok/s and another showing 30-40% faster decoding. The gains extend MTP into local runtimes used for on-device coding and long-context work.
Google released Multi-Token Prediction drafters for Gemma 4 and says decoding can run up to 3x faster without output-quality loss. vLLM and SGLang support shipped day one, so local and server deployments can try the speedup immediately.
Zyphra published folded Tensor and Sequence Parallelism, claiming 173M tok/s versus 86M for matched TP+SP on 1,024 MI300X GPUs. The design keeps more replicas inside a node, reducing per-GPU memory pressure and cross-node communication.
Zyphra launched serverless inference on AMD MI355X for DeepSeek V3.2, Kimi K2.6, and GLM 5.1, aimed at long-horizon agent workloads. The service leans on high-HBM nodes to keep more long-context sessions resident and reduce queueing.
The vLLM team shipped more than 10 DeepSeek V4 fixes as developers kept posting V4 Pro and Flash results from coding harnesses and local servers. Use the update if serving bugs, cache behavior, or tool-call reliability are blocking cheaper long-context agent runs.
Users reported moving long coding sessions from Claude to DeepSeek V4 Flash and seeing tens of millions of tokens cost only cents. Hacker News discussion also leaned toward Flash over Pro for day-to-day use, so teams should test whether the low published prices hold in their own workflows.
IBM released Granite 4.1 as three open instruct models, with third parties quickly surfacing token-efficiency and deployment access. The update matters for teams evaluating smaller open models for agent workloads where output-token burn and openness both affect production cost.
NVIDIA opened Nemotron 3 Nano Omni, a 30B-A3B model for text, image, audio, and video, with day-one serving support. That lets teams run one open model for perception-heavy agents instead of stitching separate components.
Poolside opened Laguna M.1 and Laguna XS.2 as its first public coding models, with Apache 2.0 weights and same-day provider support. That gives teams open coding models that can run locally or through standard serving stacks.
AWS says OpenAI models will land on Bedrock in coming weeks alongside a new stateful runtime. OpenAI also said its Microsoft partnership is now non-exclusive, which opens a multi-cloud path for deployment and procurement.
OpenClaw 2026.4.26 shipped Google Live Talk, local-model fixes, openclaw migrate imports for Claude and Hermes, and one-command Matrix E2EE. It also hardens plugins, Docker, and transcript compaction for self-hosted agent runs.
Xiaomi opened MiMo-V2.5 and MiMo-V2.5-Pro under MIT, adding a 1M-context multimodal agent model and a 42B-active Pro variant. SGLang and vLLM published day-one recipes, making the series immediately deployable.
vLLM 0.20.0 shipped a new CUDA 13 / PyTorch 2.11 / Transformers v5 baseline, TurboQuant 2-bit KV cache, FA4 MLA defaults, and deeper DeepSeek V4 support. The release changes serving baselines across NVIDIA, AMD, Intel, and ARM-CUDA setups, including 4x KV capacity and a clearer upgrade path for teams already running V4.
DeepSeek said cache-hit pricing across its API series is now one-tenth of launch levels, on top of the temporary V4-Pro discount through May 5. The cut lowers costs for cache-heavy long-context and agent workloads, so teams should recheck spend assumptions.
Builders published new MLX and 3-bit Qwen3.6 quants and shared reproducible local benchmarks from M3 Ultra, RTX 5070, and Radeon AI Pro setups. That gives local-agent teams concrete deployment options beyond launch-day claims, though memory budgets and long-context tool use still limit larger workflows.
DeepSeek lowered V4-Pro API pricing and updated integration guidance for Claude Code, OpenCode, and OpenClaw a day after V4 launched. Check whether V4-Flash is the easier deploy today, while Pro stays heavier and more rate-limited.
SGLang and Miles published a technical breakdown of their DeepSeek V4 day-zero stack, including ShadowRadix caching, Flash Compressor, FP4 expert-weight handling, and measured B200/H200 throughput. That gives deployers concrete serving and training-path numbers for V4 beyond generic launch-day compatibility claims.
Engineers unpacked DeepSeek V4's hybrid CSA/HCA attention a day after launch; it claims 27% of V3.2 FLOPs and 10% of its KV cache at 1M tokens. External tests pushed V4 Pro near the top of open-model indexes, but users also reported rate limits and mixed third-party results.
Within a day of launch, vLLM, SGLang, Ollama cloud, OpenCode, Venice, Together, and Baseten added support or hosted access for DeepSeek V4. That makes Flash and Pro easier to test across local, routed, and managed agent stacks.
DeepSeek open-sourced V4-Pro and V4-Flash under MIT, with 1M context and aggressive Flash pricing. Day-one support in SGLang, vLLM, and OpenRouter pushes open-weight agentic coding closer to closed frontier models.
Tencent open-sourced Hy3 preview, a 295B MoE with 21B active parameters and 256K context, then pushed it into OpenRouter, OpenCode, OpenClaw, vLLM, and SGLang immediately. That matters because engineers can test and deploy a new reasoning-agent model on day one instead of waiting for the runtime ecosystem to catch up.
DeepSeek published Tile Kernels, an open-source TileLang repo covering Engram, mHC, MoE routing, and FP4/FP8 kernels, with claims that some are already used in internal training and inference. That matters because it exposes reusable low-level performance work behind DeepSeek’s stack instead of keeping the kernels fully private.
Alibaba released Qwen3.6-27B, a dense open model with multimodal input and thinking or non-thinking modes that beats Qwen3.5-397B-A17B across major coding benchmarks. Day-one support across vLLM, SGLang, Ollama, llama.cpp, GGUF, and MLX makes it ready for local and hosted coding agents.
A day after Kimi K2.6’s launch, providers and tools opened new access paths including temporary free use in Hermes and Cline plus availability on Replicate, Together, Perplexity, and Tinker. Engineers can test the open model across agent harnesses and hosted runtimes without standing up their own stack first.
Xiaomi’s MiMo-V2.5-Pro and MiMo-V2.5 arrived with million-token context windows, stronger coding and agentic claims, and immediate access through OpenRouter plus agent harnesses. The rollout adds another low-cost Chinese frontier model that engineers can route into coding workflows without waiting for a proprietary IDE deal.
Moonshot put Kimi K2.6 on API with cache-hit/cache-miss pricing, tool calls, JSON modes, and native text-image-video input. It also open-sourced FlashKDA and landed in Warp, Cosine, Genspark, and OpenClaw, making the launch usable coding-agent infrastructure.
Kimi K2.6 shipped across vLLM, SGLang, OpenRouter, Baseten, Ollama, OpenCode, Hermes Agent, and Droid within hours of launch. That cuts the usual lag between model release and production trials, so mixed-provider agent stacks can test it sooner.
A weekend of Gemma 4 demos spanned YC hackathon projects, offline iPhone runs, and HN reports of strong local coding and SQL-agent performance. Gemma 4 is increasingly showing up as a practical edge model for tool use and multimodal apps, not just a release benchmark.
Fresh local reports put Qwen3.6-35B-A3B around 40 tok/s on M3 Ultra, extended testing to Strix Halo, and wired it into OpenClaw and Pi-style harnesses. The update matters because Qwen3.6 is moving from quant benchmarks into real local coding-agent loops with clearer hardware limits.
Moonshot says its Prefill-as-a-Service setup makes prefill/decode disaggregation practical across datacenters and mixed hardware by shrinking KV cache with Kimi Linear. The paper reports 1.54x throughput and a 64% drop in P90 time-to-first-token, so benchmark the approach before planning production adoption.
Alibaba open-sourced Qwen3.6-35B-A3B, a 35B multimodal sparse MoE with only 3B active parameters under Apache 2.0. Same-day support from vLLM, Ollama, SGLang, and GGUF builders makes it immediately usable for local and production coding workloads.
MiniMax M2.7 moved from announcement to deployment, with GGUF guidance for 128 GB local systems and same-day availability on Together, Fireworks, Hugging Face, and ModelScope. Use the local and managed serving options now, but check the non-commercial license before adopting the 230B model.
MiniMax open-sourced M2.7 and published coding and agent benchmark claims including 56.22% SWE-Pro and 57.0% Terminal Bench 2. Day-zero support from SGLang, vLLM, Ollama Cloud, Together AI, and NVIDIA NIM makes it easy to try on common serving stacks.
Providers and agent platforms added GLM-5.1 endpoints across Modal, Together AI, Letta Code, Tembo, and Tabbit, with free trials, no-key access, and 99.9% SLA options. Use the new hosting options to test the model for coding and long-horizon agent workloads without waiting on self-hosting.
HN practitioners report Gemma 4 26B-A4B near 40 tokens per second in code-agent harnesses on Mac-class hardware, and Unsloth published a free Colab fine-tuning flow. Use the local benchmark as a practical reference and the Colab path if you want task-specific tuning without added cost.
Users published reproducible 16 GB VRAM and Apple Silicon setups for the Gemma 4 26B-A4B and 31B variants. Google’s AI Gallery app also brought offline Gemma chat to phones. The setups make local coding and vision work more practical, but runtime choice, quantization, and recent llama.cpp regressions still affect reliability.
Ollama's Apple Silicon preview switches local inference to MLX, and users reportedly see sizable speedups with some Qwen3.5 variants on M-series Macs. Try it if you run local coding agents, since faster prefill and caching can cut session reload time.
New discussion around TurboQuant focuses on its 2.5-bit mixed-precision setup and working PyTorch and llama.cpp implementations. The technique is moving from a research claim into deployable KV-cache compression with concrete porting details.
Arm introduced its first production server chip under its own banner, with up to 136 Neoverse V3 cores and a 272-core dual-node reference blade. The launch pushes Arm deeper into direct datacenter silicon for agentic AI workloads, not just IP licensing.
tiny corp is shipping tinybox red v2 at $12,000 with four 9070 XT GPUs and 64 GB of GPU memory, alongside higher-end Blackwell systems. Buyers are weighing the bundled tinygrad stack against DIY rigs, model-fit limits, and cloud economics.
Compromised LiteLLM 1.82.7 and 1.82.8 wheels executed a malicious .pth file at install time to exfiltrate credentials, and PyPI quarantined the releases. Treat fresh-package installs and AI infra dependencies as supply-chain risk, and check startup hooks on affected systems.
Google Research said TurboQuant can shrink KV cache storage to 3 bits with roughly 6x less memory, and early implementations already surfaced in PyTorch, llama.cpp, and Atomic Chat. The work targets a core inference bottleneck for long-context serving on local and server hardware.
Cohere released a 2B speech-to-text model with 14 languages and top Open ASR scores, and upstreamed encoder-decoder optimizations to vLLM in the same launch. It is a self-hosted ASR option, so test accuracy and throughput on your own speech workload.
TurboQuant claims 6x KV-cache memory reduction and up to 8x faster attention on H100s without retraining or quality loss on long-context tasks. If those results hold in serving stacks, teams should revisit long-context cost, capacity, and vector-search design.
Flash-MoE now shows SSD-streamed expert weights pushing a 397B Qwen3.5 variant onto an iPhone at 0.6 tokens per second, extending its earlier laptop demos. Treat it as a memory-tiering prototype rather than a deployable mobile serving target, because speed, heat, and context headroom remain tight.
Miles added ROCm support for AMD Instinct clusters and reported GRPO post-training gains on Qwen3-30B-A3B, including AIME rising from 0.665 to 0.729. It matters if you are evaluating rollout-heavy RL jobs off NVIDIA and want concrete throughput and step-time numbers before porting.
A pure C and Metal engine streams 209GB of MoE weights from SSD and reports tool-calling support in 4-bit mode on a laptop-class Mac. It is a concrete benchmark for teams exploring expert streaming, quantization, and page-cache tricks on consumer hardware.
OpenAI says Responses API requests can reuse warm containers for skills, shell, and code interpreter, cutting startup times by about 10x. Faster execution matters more now that Codex is spreading to free users, students, and subagent-heavy workflows.