Open-source serving framework for large language models and agents.

Recent stories
OpenBMB released MiniCPM-V 4.6 1.3B, claiming 55.8 percent lower vision-encoding FLOPs, 75.7 ms TTFT on a 4090, and about 1.5x token throughput over Qwen3.5 0.8B. It targets edge deployment across mobile platforms and common inference stacks.
Google released Multi-Token Prediction drafters for Gemma 4 and says decoding can run up to 3x faster without output-quality loss. vLLM and SGLang support shipped day one, so local and server deployments can try the speedup immediately.
NVIDIA opened Nemotron 3 Nano Omni, a 30B-A3B model for text, image, audio, and video, with day-one serving support. That lets teams run one open model for perception-heavy agents instead of stitching separate components.
Xiaomi opened MiMo-V2.5 and MiMo-V2.5-Pro under MIT, adding a 1M-context multimodal agent model and a 42B-active Pro variant. SGLang and vLLM published day-one recipes, making the series immediately deployable.
SGLang and Miles published a technical breakdown of their DeepSeek V4 day-zero stack, including ShadowRadix caching, Flash Compressor, FP4 expert-weight handling, and measured B200/H200 throughput. That gives deployers concrete serving and training-path numbers for V4 beyond generic launch-day compatibility claims.
DeepSeek lowered V4-Pro API pricing and updated integration guidance for Claude Code, OpenCode, and OpenClaw a day after V4 launched. Check whether V4-Flash is the easier deploy today, while Pro stays heavier and more rate-limited.
Within a day of launch, vLLM, SGLang, Ollama cloud, OpenCode, Venice, Together, and Baseten added support or hosted access for DeepSeek V4. That makes Flash and Pro easier to test across local, routed, and managed agent stacks.
DeepSeek open-sourced V4-Pro and V4-Flash under MIT, with 1M context and aggressive Flash pricing. Day-one support in SGLang, vLLM, and OpenRouter pushes open-weight agentic coding closer to closed frontier models.
Tencent open-sourced Hy3 preview, a 295B MoE with 21B active parameters and 256K context, then pushed it into OpenRouter, OpenCode, OpenClaw, vLLM, and SGLang immediately. That matters because engineers can test and deploy a new reasoning-agent model on day one instead of waiting for the runtime ecosystem to catch up.
Alibaba released Qwen3.6-27B, a dense open model with multimodal input and thinking or non-thinking modes that beats Qwen3.5-397B-A17B across major coding benchmarks. Day-one support across vLLM, SGLang, Ollama, llama.cpp, GGUF, and MLX makes it ready for local and hosted coding agents.
Kimi K2.6 shipped across vLLM, SGLang, OpenRouter, Baseten, Ollama, OpenCode, Hermes Agent, and Droid within hours of launch. That cuts the usual lag between model release and production trials, so mixed-provider agent stacks can test it sooner.
MiniMax open-sourced M2.7 and published coding and agent benchmark claims including 56.22% SWE-Pro and 57.0% Terminal Bench 2. Day-zero support from SGLang, vLLM, Ollama Cloud, Together AI, and NVIDIA NIM makes it easy to try on common serving stacks.
Z.ai released GLM-5.1, a 744B open model built for long-horizon agentic coding and ranked first among open systems on SWE-Bench Pro. Day-0 support in OpenRouter, Ollama, SGLang, vLLM, OpenCode, and local quantization paths makes it ready to test in existing stacks.