Skip to content
AI Primer
TOPIC50 stories

LLM Serving

Serving stacks and runtime systems for model inference.

RELEASE27th June
DeepSeek V4-Pro benchmarks at ~90 tok/s after DSpark rollout

Independent measurements after DSpark put DeepSeek V4-Pro around 90 tok/s and cut one run from 214s to 116s. The gain matters because it lowers serving cost, though tuning details and memory overhead are still unclear.

RELEASE26th June
DeepSeek releases DeepSpec and DSpark for speculative decoding on V4 checkpoints

DeepSeek open-sourced DeepSpec, a codebase for training and evaluating draft models for speculative decoding, alongside the DSpark decoding module for V4 checkpoints. It matters because inference teams get a new open stack for improving draft-model quality and decode throughput beyond earlier MTP-style baselines.

RELEASE24th June
Vercel AI Gateway adds GLM-5.2 Fast at 150-250 tok/s

Vercel and Wafer launched a serverless GLM-5.2 endpoint on AI Gateway with 1M context and published pricing. Teams get a high-throughput open-model option inside an existing gateway instead of managing GLM inference directly.

NEWS22nd June
GLM-5.2 adds Perplexity Agent API and Droid support on Baseten at >280 TPS

GLM-5.2 added Perplexity Agent API, Droid, and more hosting options, while Baseten reported over 280 TPS and sub-0.8s TTFT. Builders should watch the cost and benchmark data as it moves into production agent stacks.

RELEASE21st June
Morph supports Qwen, GLM-5.2, MiniMax M3, DeepSeek v4 with 20-35% higher code acceptance

Morph said its code-serving stack now exposes Qwen, GLM-5.2, MiniMax M3, and DeepSeek v4 with code-tuned speculative decoding. It claims 20-35% higher acceptance than Eagle 3.1 or DFlash, plus kernels for cheaper hardware.

NEWS1w ago
GLM-5.2 ships to BrowserCode, Hyper, OpenCode, and Together in 3 days

BrowserCode, Hyper, OpenCode, Together, and other vendors added GLM-5.2 soon after release. That turns the open model into a deployable option across coding, browser automation, and hosted chat.

NEWS1w ago
Engineers compare GLM-5.2 local builds: $10k Mac Studio, 17 tok/s, and 2-bit quant tradeoffs

Practitioners published concrete GLM-5.2 self-host numbers, from Mac Studio and 4090-class setups to annualized power and hardware costs. That matters because open weights now offer privacy and rate-limit control, but quant quality, electricity, and latency still keep hosted APIs cheaper for many teams.

NEWS1w ago
Ollama raises GLM-5.2 cloud capacity on NVIDIA B300s

Ollama said it doubled GPU capacity for GLM-5.2 cloud usage and said the model is currently hosted only in the US. The rollout adds capacity as open-model demand climbs, so users should check hosting and privacy details before deploying.

NEWS1w ago
Wafer claims GLM-5.2 hits 222 tok/s and 12.6s end-to-end

Wafer said its GLM-5.2 deployment leads Artificial Analysis on throughput and latency, and priced usage at $1.20 input and $4.10 output per million tokens. Compare serverless and dedicated endpoints if you need speed at scale.

WORKFLOW1w ago
GLM-5.2 ships in Claude Code, Droid, and 2-bit GGUF workflows

Builders published Claude Code and Droid setups for GLM-5.2 while Unsloth quantized it for local 256GB machines and Hugging Face opened temporary free inference. Teams can now run the open-weight model across hosted, local, and agent workflows.

RELEASE1w ago
Poolside releases Laguna M.1 open weights with 225B MoE and 256K context

Poolside released Apache 2.0 weights for Laguna M.1 and XS.2, its long-horizon coding models, with M.1 shipping at 225B total parameters, 23B active, and 256K context. SGLang and vLLM support on day one lets teams run and fine-tune the models in existing agent stacks immediately.

WORKFLOW1w ago
Codex supports open-weight models via Ollama, vLLM, and Responses-compatible endpoints

Codex workflows can now run against open-weight models served through compatible Responses API endpoints, with Ollama and vLLM publishing direct paths for GLM-5.2 and Kimi K2.7 Code. That matters because teams can keep the Codex interface while swapping to self-hosted or lower-cost inference backends.

RELEASE1w ago
Z.ai releases GLM-5.2 open weights with 1M context and 46.2% DeepSWE

Z.ai released GLM-5.2 MIT-licensed open weights with 1M context and broad runtime support. Vendor and arena results put it near frontier closed models on long-horizon coding.

RELEASE1w ago
Batchwork launches a unified batch API for 7 AI providers

Batchwork launched a wrapper that normalizes batch submission, polling, and result handling across seven AI providers. It turns provider-specific async batch formats into one interface for evals, migrations, and large offline jobs.

RELEASE1w ago
SGLang adds DFlash and Spec V2 with 4.3x Qwen3.5-397B-A17B throughput

LMSYS and Modal shipped DFlash plus Spec V2 in SGLang, claiming 4.3x baseline throughput and 1.5x native MTP on Qwen3.5-397B-A17B. It cuts latency and serving cost for very large open models.

NEWS2w ago
Together AI ranks DeepSeek V4 Pro #1 on Artificial Analysis latency and speed

Together AI said its DeepSeek V4 Pro deployment now leads Artificial Analysis on both output speed and latency. The claim matters because it turns V4 serving into an inference-systems story about KV cache reuse, prefix reuse, kernels, and endpoint profiles rather than model weights alone.

RELEASE2w ago
MiniMax opens M3 weights: 428B total, 23B active, 1M context

MiniMax published M3 weights on Hugging Face with 428B total parameters, 23B active parameters, 1M context, and multimodal support. Unsloth quickly added local GGUF builds, so teams can try 2-bit runs at 138GB RAM or VRAM and 3-bit at 165GB.

NEWS2w ago
North Mini Code adds MLX, Unsloth GGUFs, and oMLX support

Cohere added MLX support, Unsloth GGUFs, oMLX work, and updated docs for North Mini Code two days after launch, with llama.cpp still under review. The broader runtime coverage makes the 30B coding model easier to run on local Mac, quantized, and self-hosted stacks.

RELEASE2w ago
Google releases DiffusionGemma 26B-A4B with 4x faster block-based text decoding

Google released Apache 2.0 DiffusionGemma, a 26B-A4B diffusion text model that claims up to 4x faster output by generating text in blocks instead of one token at a time. The release matters for local and hosted stacks that want to test a new decoding path.

RELEASE2w ago
vLLM, Unsloth, and llama.cpp add DiffusionGemma support after launch

Google's new diffusion text model picked up same-day runtime support: vLLM added native diffusion-LM serving, Unsloth shipped GGUFs, and llama.cpp got local setup guidance. That shortens the path from release to local and hosted evaluation.

NEWS2w ago
Apple claims 20B on-device model uses query-routed experts on iPhone 17 Pro

Apple said its most powerful on-device model runs on iPhone 17 Pro, while independent analysis describes a 20B design that routes a query to experts loaded from NAND into RAM. The architecture matters because it trades dense inference for hardware-aware expert selection, but access is constrained by device and region limits.

NEWS3w ago
Posts cite Korean reporting: NVIDIA claims HBM4 supply for Vera Rubin from Samsung, SK hynix, Micron

Posts citing Korean reporting said NVIDIA qualified HBM4 from Samsung, SK hynix, and Micron for Vera Rubin and expanded memory co-design with SK hynix. The supply detail matters because HBM4 availability is the constraint behind next-generation AI systems.

RELEASE3w ago
Google releases Gemma 4 QAT: E2B drops to ~1GB and Ollama, SGLang, vLLM add support

Google published Gemma 4 QAT checkpoints and mobile-focused quant formats, cutting Gemma 4 E2B to roughly 1GB of memory. Ollama, SGLang, and vLLM added day-one support, making local deployment more practical on phones, laptops, and low-VRAM GPUs.

RELEASE3w ago
NVIDIA releases Nemotron 3 Ultra: 550B MoE, 1M context

NVIDIA shipped Nemotron 3 Ultra, a 550B/55B-active hybrid Mamba-Transformer MoE with open weights, data, and recipe, plus broad runtime and host support. It matters because the model pairs frontier open benchmarks with immediate agent-serving options, though local use still needs heavy quantization or large-memory hardware.

RELEASE3w ago
Gemma 4 12B ships encoder-free multimodal local model with 16GB target and 256K context

Google released Gemma 4 12B, an Apache 2.0 encoder-free multimodal model with native audio and vision for 16GB-class laptops. Day-zero support in llama.cpp, vLLM, Ollama, MLX, and SGLang should make local agents and on-device apps easier to deploy immediately.

RELEASE3w ago
Microsoft and NVIDIA launch RTX Spark PCs with 128GB unified memory and 1 PFLOP FP4

Microsoft and NVIDIA unveiled RTX Spark systems, including Surface Laptop Ultra and DGX-class Windows hardware, with 128GB unified memory and 1 PFLOP FP4 local AI. Day-one support from Hermes Agent, vLLM, Ollama, and Unsloth makes the launch useful for local inference and fine-tuning, not just a PC refresh.

RELEASE4w ago
Step 3.7 Flash opens 30-day free access for Hermes users via Nous Portal

A day after launch, Nous made Step 3.7 Flash free for 30 days to Hermes users through Nous Portal. The access window landed alongside fresh vLLM/NIM and MLX-VLM support, making the model easier to test in both local and production stacks.

RELEASE4w ago
vLLM releases v0.22.0 with 28.9% FP8 latency cuts and KV offloading

vLLM 0.22.0 shipped DeepSeek V4 hardening, a Rust frontend, batch-invariant Cutlass FP8 paths, and multi-tier KV cache offloading. The release also removes deprecated APIs, so some serving stacks will need upgrade work.

NEWS4w ago
Step 3.7 Flash launches with day-one support in Kilo, Modal, SGLang, Hermes, and DesignArena

Step 3.7 Flash landed immediately across Kilo, Modal, SGLang, Hermes-linked tooling, and DesignArena as the model’s 198B MoE, 256K-context release spread through the stack. The breadth of day-one support gives engineers multiple ways to serve, benchmark, and wire the new open-weight multimodal model into agents.

RELEASE4w ago
llama.cpp launches official site with one-line installer and unified `llama` CLI

llama.cpp now has an official website and a single-line installer that provides one `llama` entrypoint for running, serving, and agent integrations. The packaging change simplifies local setup while reusing GGUF models already on disk.

RELEASE4w ago
Perplexity releases Unigram tokenizer with 5-6x lower CPU use

Perplexity open-sourced the XLM-RoBERTa Unigram tokenizer it rebuilt for ranking and retrieval, reporting 5-6x lower CPU use and 63 microsecond p50 at 514 tokens. Teams running fast rerankers and embedders should watch tokenization cost as a latency bottleneck.

NEWS4w ago
OpenRouter raises $113M Series B as weekly volume hits 25T tokens

OpenRouter announced a $113M Series B led by CapitalG and said weekly routed volume grew from 5T to 25T tokens in six months. The funding matters because the company is pitching itself as production infrastructure for multi-model deployments, not just an API convenience layer.

NEWS4w ago
MiniMax claims M3 sparse attention cuts 1M-token prefill 9.7x and decode 15.6x

MiniMax started winding down its M2 series while previewing M3 and a new sparse-attention design with large long-context speedup claims. The teaser points to a fresh open-model race around block selection, GQA, and million-token serving efficiency.

RELEASE4w ago
MiniCPM5-1B launches with 17.9 AA and ~0.5GB INT4 weights

OpenBMB released MiniCPM5-1B and says the model leads Artificial Analysis' small-model index at 17.9 while fitting into roughly 0.5GB in INT4. The release matters because it targets phones, browsers, and local runtimes with a sub-2B open model.

NEWS1mo ago
DeepSeek cuts V4 Pro pricing 75% to $0.435 input and $0.87 output

DeepSeek made the temporary 75% V4 Pro discount permanent, cutting first-party pricing to $0.435 per million input tokens and $0.87 output. Artificial Analysis now places it on the cost-performance frontier, but practitioners still question per-task efficiency on harder coding work.

RELEASE1mo ago
Cohere releases Command A+ under Apache 2.0 with 25B active params and 2x H100 deployment

Cohere open-sourced Command A+, a 218B MoE multimodal model with 25B active parameters, 48-language support, and deployment starting at two H100s. Artificial Analysis put it at 37 on its Intelligence Index and 281 tok/s, and vLLM plus Transformers added support.

RELEASE1mo ago
SGLang 0.5.12 adds DeepSeek V4 serving with ShadowRadix and HiSparse

SGLang v0.5.12 added native DeepSeek V4 support with ShadowRadix prefix caching, HiSparse CPU-extended KV, MegaMoE kernels, and Blackwell MLA work. The release broadens hardware targets and improves long-context serving efficiency for open runtimes.

RELEASE1mo ago
llama.cpp provider adds in-process AI SDK support with tool calling

A new llama.cpp provider lets the AI SDK run directly inside a Node process without a separate server, while exposing reasoning, tool calling, image inputs, and prompt caching. The setup shortens local deployment paths for AI SDK apps that want llama.cpp bindings.

RELEASE1mo ago
Nous Research releases Lighthouse Attention: 1.4-1.7x faster pretraining at 98K context

Nous Research published Lighthouse Attention, a hierarchical selection layer that keeps the standard attention kernel while cutting end-to-end pretraining wall clock by 1.4-1.7x at 98K context. It also scales to 1M-token training across 32 Blackwell GPUs without a custom sparse kernel.

RELEASE1mo ago
Together AI launches Gemma-4-31B-it-Pearl endpoint with 25%+ discounted pricing

Together AI launched Gemma-4-31B-it-Pearl as a serverless endpoint that uses Pearl's proof-of-useful-work emissions to offset inference cost. It matters because the pricing model ties serving economics to compute-side byproducts instead of token billing alone.

RELEASE1mo ago
Unsloth updates Qwen3.5 MTP GGUFs with draft-mtp flags for 1.8x speed

Unsloth said its updated Qwen3.5 MTP GGUFs now run about 1.8x faster after llama.cpp added spec-draft-p-min 0.75 and renamed the mode to draft-mtp. The update also raises draft-token settings and expands the small-model MTP set for local runners.

RELEASE1mo ago
Zyphra releases ZAYA1-8B-Diffusion-Preview on AMD with 4.6x-7.7x faster decoding

Zyphra released ZAYA1-8B-Diffusion-Preview, its first diffusion language model trained on AMD, and said 16-token block generation delivers 4.6x-7.7x faster decoding with limited quality loss. The design targets autoregressive KV-cache bottlenecks while keeping post-training and test-time compute viable.

NEWS1mo ago
Perplexity benchmarks Qwen3 235B on GB200 NVL72: NVLS latency drops from 586 µs to 313 µs

Perplexity published serving results for post-trained Qwen3 235B on NVIDIA GB200 NVL72 and argues Blackwell materially outperforms Hopper for large MoE inference. The deltas show up in NVLS all-reduce latency, MoE prefill combine time, and high-speed decode throughput.

NEWS1mo ago
Local users report DeepSeek V4 Flash, Qwen 3.6, and Gemma 4 at 40-200 tok/s on Macs and 3090s

Developers posted new local-model measurements for DS4, Qwen 3.6, and Gemma 4: about 40 tok/s on an M3 Ultra, 70+ tok/s on MacBooks with MPS, and 120-200 tok/s for Qwen3.6-27B on a single RTX 3090. The numbers suggest coding-capable local runs are moving from demos toward regular use.

RELEASE1mo ago
Gemma 4 adds MTP drafters for up to 3x faster decoding

Google released Multi-Token Prediction drafters for Gemma 4 and says decoding can run up to 3x faster without output-quality loss. vLLM and SGLang support shipped day one, so local and server deployments can try the speedup immediately.

RELEASE1mo ago
Zyphra releases folded TSP with 173M tok/s on 1,024 MI300X GPUs

Zyphra published folded Tensor and Sequence Parallelism, claiming 173M tok/s versus 86M for matched TP+SP on 1,024 MI300X GPUs. The design keeps more replicas inside a node, reducing per-GPU memory pressure and cross-node communication.

RELEASE1mo ago
Zyphra Inference launches MI355X endpoints for DeepSeek V3.2, Kimi K2.6, and GLM 5.1

Zyphra launched serverless inference on AMD MI355X for DeepSeek V3.2, Kimi K2.6, and GLM 5.1, aimed at long-horizon agent workloads. The service leans on high-HBM nodes to keep more long-context sessions resident and reduce queueing.

RELEASE1mo ago
vLLM 0.20.1 fixes DeepSeek V4 TopK deadlocks and tool-call errors

The vLLM team shipped more than 10 DeepSeek V4 fixes as developers kept posting V4 Pro and Flash results from coding harnesses and local servers. Use the update if serving bugs, cache behavior, or tool-call reliability are blocking cheaper long-context agent runs.

NEWS1mo ago
Developers report DeepSeek V4 Flash handles 32M-token coding runs for $0.25

Users reported moving long coding sessions from Claude to DeepSeek V4 Flash and seeing tens of millions of tokens cost only cents. Hacker News discussion also leaned toward Flash over Pro for day-to-day use, so teams should test whether the low published prices hold in their own workflows.

RELEASE2mo ago
IBM releases Granite 4.1 30B/8B/3B open models under Apache 2.0

IBM released Granite 4.1 as three open instruct models, with third parties quickly surfacing token-efficiency and deployment access. The update matters for teams evaluating smaller open models for agent workloads where output-token burn and openness both affect production cost.

AI PrimerAI Primer

Your daily guide to AI tools, workflows, and creative inspiration.

© 2026 AI Primer. All rights reserved.