Skip to content
AI Primer

Open-source serving framework for large language and vision-language models with structured generation and OpenAI-compatible APIs.

Screenshot of SGLang website

Recent stories

19 linked stories
releasePRIMARY2026-06-15
SGLang adds DFlash and Spec V2 with 4.3x Qwen3.5-397B-A17B throughput

LMSYS and Modal shipped DFlash plus Spec V2 in SGLang, claiming 4.3x baseline throughput and 1.5x native MTP on Qwen3.5-397B-A17B. It cuts latency and serving cost for very large open models.

releaseSECONDARY2026-06-05
Google releases Gemma 4 QAT: E2B drops to ~1GB and Ollama, SGLang, vLLM add support

Google published Gemma 4 QAT checkpoints and mobile-focused quant formats, cutting Gemma 4 E2B to roughly 1GB of memory. Ollama, SGLang, and vLLM added day-one support, making local deployment more practical on phones, laptops, and low-VRAM GPUs.

releaseSECONDARY2026-06-04
NVIDIA releases Nemotron 3 Ultra: 550B MoE, 1M context

NVIDIA shipped Nemotron 3 Ultra, a 550B/55B-active hybrid Mamba-Transformer MoE with open weights, data, and recipe, plus broad runtime and host support. It matters because the model pairs frontier open benchmarks with immediate agent-serving options, though local use still needs heavy quantization or large-memory hardware.

releaseSECONDARY2026-06-01
NVIDIA launches Cosmos 3 open 16B and 64B omnimodels with datasets and SGLang support

NVIDIA released Cosmos 3 as an open omnimodel family with 16B and 64B variants, plus code, datasets, and a coalition around physical AI. The release matters because it ships with serving support and top open-weight image and video rankings, so teams can use it beyond a research teaser.

newsSECONDARY2026-05-29
Step 3.7 Flash launches with day-one support in Kilo, Modal, SGLang, Hermes, and DesignArena

Step 3.7 Flash landed immediately across Kilo, Modal, SGLang, Hermes-linked tooling, and DesignArena as the model’s 198B MoE, 256K-context release spread through the stack. The breadth of day-one support gives engineers multiple ways to serve, benchmark, and wire the new open-weight multimodal model into agents.

releasePRIMARY2026-05-16
SGLang 0.5.12 adds DeepSeek V4 serving with ShadowRadix and HiSparse

SGLang v0.5.12 added native DeepSeek V4 support with ShadowRadix prefix caching, HiSparse CPU-extended KV, MegaMoE kernels, and Blackwell MLA work. The release broadens hardware targets and improves long-context serving efficiency for open runtimes.

releaseSECONDARY2026-05-11
OpenBMB releases MiniCPM-V 4.6 1.3B with 75.7 ms TTFT and 19x token efficiency

OpenBMB released MiniCPM-V 4.6 1.3B, claiming 55.8 percent lower vision-encoding FLOPs, 75.7 ms TTFT on a 4090, and about 1.5x token throughput over Qwen3.5 0.8B. It targets edge deployment across mobile platforms and common inference stacks.

releaseSECONDARY2026-05-05
Gemma 4 adds MTP drafters for up to 3x faster decoding

Google released Multi-Token Prediction drafters for Gemma 4 and says decoding can run up to 3x faster without output-quality loss. vLLM and SGLang support shipped day one, so local and server deployments can try the speedup immediately.

releaseSECONDARY2026-04-28
Nemotron 3 Nano Omni launches 30B-A3B multimodal model with 256K context

NVIDIA opened Nemotron 3 Nano Omni, a 30B-A3B model for text, image, audio, and video, with day-one serving support. That lets teams run one open model for perception-heavy agents instead of stitching separate components.

releaseSECONDARY2026-04-27
MiMo-V2.5 opens under MIT with 1M context and SGLang vLLM support

Xiaomi opened MiMo-V2.5 and MiMo-V2.5-Pro under MIT, adding a 1M-context multimodal agent model and a 42B-active Pro variant. SGLang and vLLM published day-one recipes, making the series immediately deployable.

newsPRIMARY2026-04-25
SGLang supports DeepSeek V4 with 199 tok/s on B200 and 240 tok/s at 900K context

SGLang and Miles published a technical breakdown of their DeepSeek V4 day-zero stack, including ShadowRadix caching, Flash Compressor, FP4 expert-weight handling, and measured B200/H200 throughput. That gives deployers concrete serving and training-path numbers for V4 beyond generic launch-day compatibility claims.

newsSECONDARY2026-04-25
DeepSeek cuts V4-Pro API 75% to $0.43/$0.87 per 1M tokens through May 5

DeepSeek lowered V4-Pro API pricing and updated integration guidance for Claude Code, OpenCode, and OpenClaw a day after V4 launched. Check whether V4-Flash is the easier deploy today, while Pro stays heavier and more rate-limited.

newsSECONDARY2026-04-24
DeepSeek V4 adds day-1 support from vLLM, SGLang, Ollama, OpenCode, Venice, and Together

Within a day of launch, vLLM, SGLang, Ollama cloud, OpenCode, Venice, Together, and Baseten added support or hosted access for DeepSeek V4. That makes Flash and Pro easier to test across local, routed, and managed agent stacks.

releaseSECONDARY2026-04-23
DeepSeek releases V4-Pro and V4-Flash with 1M context and $0.14/M input

DeepSeek open-sourced V4-Pro and V4-Flash under MIT, with 1M context and aggressive Flash pricing. Day-one support in SGLang, vLLM, and OpenRouter pushes open-weight agentic coding closer to closed frontier models.

releaseSECONDARY2026-04-23
Tencent launches Hy3 preview with 295B/21B, 256K context, and day-one OpenRouter, vLLM, and SGLang support

Tencent open-sourced Hy3 preview, a 295B MoE with 21B active parameters and 256K context, then pushed it into OpenRouter, OpenCode, OpenClaw, vLLM, and SGLang immediately. That matters because engineers can test and deploy a new reasoning-agent model on day one instead of waiting for the runtime ecosystem to catch up.

releaseSECONDARY2026-04-22
Qwen3.6-27B releases with 77.2 SWE-Bench Verified and Apache 2.0

Alibaba released Qwen3.6-27B, a dense open model with multimodal input and thinking or non-thinking modes that beats Qwen3.5-397B-A17B across major coding benchmarks. Day-one support across vLLM, SGLang, Ollama, llama.cpp, GGUF, and MLX makes it ready for local and hosted coding agents.

newsSECONDARY2026-04-20
Kimi K2.6 adds day-one support across vLLM, SGLang, Ollama, and OpenRouter

Kimi K2.6 shipped across vLLM, SGLang, OpenRouter, Baseten, Ollama, OpenCode, Hermes Agent, and Droid within hours of launch. That cuts the usual lag between model release and production trials, so mixed-provider agent stacks can test it sooner.

releaseSECONDARY2026-04-11
MiniMax releases M2.7 open model with 56.22% SWE-Pro and 57.0% Terminal Bench 2

MiniMax open-sourced M2.7 and published coding and agent benchmark claims including 56.22% SWE-Pro and 57.0% Terminal Bench 2. Day-zero support from SGLang, vLLM, Ollama Cloud, Together AI, and NVIDIA NIM makes it easy to try on common serving stacks.

releaseSECONDARY2026-04-07
Z.ai releases GLM-5.1, a 744B open model with 58.4 SWE-Bench Pro and 8-hour agent runs

Z.ai released GLM-5.1, a 744B open model built for long-horizon agentic coding and ranked first among open systems on SWE-Bench Pro. Day-0 support in OpenRouter, Ollama, SGLang, vLLM, OpenCode, and local quantization paths makes it ready to test in existing stacks.

AI PrimerAI Primer

Your daily guide to AI tools, workflows, and creative inspiration.

© 2026 AI Primer. All rights reserved.