Skip to content
AI Primer

Open-source serving framework for large language models and agents.

Screenshot of SGLang website

Recent stories

13 linked stories
releaseSECONDARY2026-05-11
OpenBMB releases MiniCPM-V 4.6 1.3B with 75.7 ms TTFT and 19x token efficiency

OpenBMB released MiniCPM-V 4.6 1.3B, claiming 55.8 percent lower vision-encoding FLOPs, 75.7 ms TTFT on a 4090, and about 1.5x token throughput over Qwen3.5 0.8B. It targets edge deployment across mobile platforms and common inference stacks.

releaseSECONDARY2026-05-05
Gemma 4 adds MTP drafters for up to 3x faster decoding

Google released Multi-Token Prediction drafters for Gemma 4 and says decoding can run up to 3x faster without output-quality loss. vLLM and SGLang support shipped day one, so local and server deployments can try the speedup immediately.

releaseSECONDARY2026-04-28
Nemotron 3 Nano Omni launches 30B-A3B multimodal model with 256K context

NVIDIA opened Nemotron 3 Nano Omni, a 30B-A3B model for text, image, audio, and video, with day-one serving support. That lets teams run one open model for perception-heavy agents instead of stitching separate components.

releaseSECONDARY2026-04-27
MiMo-V2.5 opens under MIT with 1M context and SGLang vLLM support

Xiaomi opened MiMo-V2.5 and MiMo-V2.5-Pro under MIT, adding a 1M-context multimodal agent model and a 42B-active Pro variant. SGLang and vLLM published day-one recipes, making the series immediately deployable.

newsPRIMARY2026-04-25
SGLang supports DeepSeek V4 with 199 tok/s on B200 and 240 tok/s at 900K context

SGLang and Miles published a technical breakdown of their DeepSeek V4 day-zero stack, including ShadowRadix caching, Flash Compressor, FP4 expert-weight handling, and measured B200/H200 throughput. That gives deployers concrete serving and training-path numbers for V4 beyond generic launch-day compatibility claims.

newsSECONDARY2026-04-25
DeepSeek cuts V4-Pro API 75% to $0.43/$0.87 per 1M tokens through May 5

DeepSeek lowered V4-Pro API pricing and updated integration guidance for Claude Code, OpenCode, and OpenClaw a day after V4 launched. Check whether V4-Flash is the easier deploy today, while Pro stays heavier and more rate-limited.

newsSECONDARY2026-04-24
DeepSeek V4 adds day-1 support from vLLM, SGLang, Ollama, OpenCode, Venice, and Together

Within a day of launch, vLLM, SGLang, Ollama cloud, OpenCode, Venice, Together, and Baseten added support or hosted access for DeepSeek V4. That makes Flash and Pro easier to test across local, routed, and managed agent stacks.

releaseSECONDARY2026-04-23
DeepSeek releases V4-Pro and V4-Flash with 1M context and $0.14/M input

DeepSeek open-sourced V4-Pro and V4-Flash under MIT, with 1M context and aggressive Flash pricing. Day-one support in SGLang, vLLM, and OpenRouter pushes open-weight agentic coding closer to closed frontier models.

releaseSECONDARY2026-04-23
Tencent launches Hy3 preview with 295B/21B, 256K context, and day-one OpenRouter, vLLM, and SGLang support

Tencent open-sourced Hy3 preview, a 295B MoE with 21B active parameters and 256K context, then pushed it into OpenRouter, OpenCode, OpenClaw, vLLM, and SGLang immediately. That matters because engineers can test and deploy a new reasoning-agent model on day one instead of waiting for the runtime ecosystem to catch up.

releaseSECONDARY2026-04-22
Qwen3.6-27B releases with 77.2 SWE-Bench Verified and Apache 2.0

Alibaba released Qwen3.6-27B, a dense open model with multimodal input and thinking or non-thinking modes that beats Qwen3.5-397B-A17B across major coding benchmarks. Day-one support across vLLM, SGLang, Ollama, llama.cpp, GGUF, and MLX makes it ready for local and hosted coding agents.

newsSECONDARY2026-04-20
Kimi K2.6 adds day-one support across vLLM, SGLang, Ollama, and OpenRouter

Kimi K2.6 shipped across vLLM, SGLang, OpenRouter, Baseten, Ollama, OpenCode, Hermes Agent, and Droid within hours of launch. That cuts the usual lag between model release and production trials, so mixed-provider agent stacks can test it sooner.

releaseSECONDARY2026-04-11
MiniMax releases M2.7 open model with 56.22% SWE-Pro and 57.0% Terminal Bench 2

MiniMax open-sourced M2.7 and published coding and agent benchmark claims including 56.22% SWE-Pro and 57.0% Terminal Bench 2. Day-zero support from SGLang, vLLM, Ollama Cloud, Together AI, and NVIDIA NIM makes it easy to try on common serving stacks.

releaseSECONDARY2026-04-07
Z.ai releases GLM-5.1, a 744B open model with 58.4 SWE-Bench Pro and 8-hour agent runs

Z.ai released GLM-5.1, a 744B open model built for long-horizon agentic coding and ranked first among open systems on SWE-Bench Pro. Day-0 support in OpenRouter, Ollama, SGLang, vLLM, OpenCode, and local quantization paths makes it ready to test in existing stacks.

AI PrimerAI Primer

Your daily guide to AI tools, workflows, and creative inspiration.

© 2026 AI Primer. All rights reserved.