updateApril 24, 2026

DeepSeek V4 adds day-1 support from vLLM, SGLang, Ollama, OpenCode, Venice, and Together

Within a day of launch, vLLM, SGLang, Ollama cloud, OpenCode, Venice, Together, and Baseten added support or hosted access for DeepSeek V4. That makes Flash and Pro easier to test across local, routed, and managed agent stacks.

5 min read

DeepSeek V4 adds day-1 support from vLLM, SGLang, Ollama, OpenCode, Venice, and Together

TL;DR

DeepSeek V4 landed with unusually broad day-zero infra support: vllm_project's launch thread shipped serving recipes and a long-context implementation walkthrough, while lmsysorg's launch post put SGLang and Miles RL training online the same day.
Access showed up across three different surfaces at once: Ollama's cloud launch exposed Flash through CLI wrappers, opencode's update added Pro and Flash to its Go client, and AskVenice's release post made both models directly runnable in Venice.
Provider rollouts arrived almost immediately after the core engine work: Together's announcement advertised V4 Pro with pricing and three reasoning modes, while Baseten's model post described Pro and Flash as a 1M-context preview pair.
The underlying story is a deployment story as much as a model story: lmsysorg's architecture thread and vllm_project's implementation notes both spent most of their launch copy on cache compression, hybrid attention, and kernel work needed to make 1M-token serving practical.

You can open the SGLang cookbook, browse the DeepSeek-V4-Pro weights and DeepSeek-V4-Flash weights, test Flash on Ollama Cloud, and jump straight into Venice's V4 Pro chat.

vLLM and SGLang

The first serious day-zero support came from the serving stacks, not the app wrappers.

According to vllm_project's thread, vLLM implemented DeepSeek V4's long-context path with four main pieces:

Shared K/V plus inverse RoPE for 2x memory savings.
c4a and c128a KV compression for 4x to 128x savings.
Sparse attention over compressed tokens.
A short sliding window to preserve local context.

The headline number in that post was per-layer KV state at 1M context, about 9.62 GiB instead of 83.9 GiB for a V3.2-style 61-layer stack in bf16.

lmsysorg's architecture thread framed the SGLang side as a systems release: ShadowRadix for prefix caching on hybrid attention, HiSparse to extend sparse-attention KV into CPU memory, and a Miles RL training pipeline shipping alongside serving. The linked SGLang cookbook and support PR made the rollout look more like a reference implementation than a simple model flag.

Ollama and OpenCode

DeepSeek V4 also showed up fast in developer-facing wrappers, which is the part most people will actually touch first.

Ollama's post put deepseek-v4-flash on its US-hosted cloud and immediately wired it into multiple front ends:

ollama launch claude --model deepseek-v4-flash:cloud
ollama launch openclaw --model deepseek-v4-flash:cloud
ollama launch hermes --model deepseek-v4-flash:cloud
ollama run deepseek-v4-flash:cloud

That same post said Pro was coming shortly, after an earlier Ollama update had only promised work in progress.

opencode's release note added Pro and Flash to OpenCode's Go client in v1.14.24, but also said the team was still working through capacity and usage limits. badlogicgames' follow-up did the same thing on pi.dev, where DeepSeek became a built-in provider behind /login -> Api Key -> DeepSeek.

Hosted endpoints

Managed inference providers moved almost as fast as the open serving projects.

Together's announcement listed the commercial packaging clearly: V4 Pro, 99.9% SLA, function calling, JSON mode, and posted pricing at $2.10 input, $4.40 output, and $0.20 cached input per million tokens. It also surfaced the model-side claims that matter for deployment, 93.5 on LiveCodeBench, 3206 on Codeforces, 80.6 percent on SWE-Bench Verified, hybrid attention efficiency at 27 percent FLOPs and 10 percent KV cache versus V3.2, and three reasoning modes named Non-think, Think High, and Think Max.

AskVenice's launch post made both Pro and Flash available in a consumer-facing chat product, added anonymous access, and repeated the 1M-context claim. Baseten's post filled in another deployment detail by sizing the two models as 1.6T parameters for Pro and 284B for Flash, and by calling the release a preview rather than a fully settled serving tier.

API and compatibility shifts

Part of the rollout happened before the official splash screen.

The pre-launch chatter in teortaxesTex's API spot and koltregaskes' translated screenshot suggested DeepSeek had exposed a 1M-context build on the official API before docs caught up. AiBattle_ then reported a rollback after API issues.

By launch day, teortaxesTex's later thread pointed to the cleaner compatibility story: deepseek-chat and deepseek-reasoner were being folded into non-thinking and thinking modes of deepseek-v4-flash, with chat prefix completion and FIM completion returning in beta. That unification matters because the rollout was not only about new weights, it also changed the model slugs and the way reasoning control appears at the API layer.

MLX and Mac paths

The last interesting wrinkle is how quickly people started trying to squeeze V4 into local Apple-flavored routes.

simonw's question about DeepSeek-V4-Flash on Macs landed only hours after launch, which is about as good a signal as you get that the community saw Flash as a plausible local target. TheZachMueller's repost said MLX community quants had already hit Hugging Face, with LambdaAPI and Zach Mueller credited for making the release possible.

That gives the first day-one map in one shot: heavyweight support in vLLM and SGLang, hosted access through Ollama, Venice, Together, and Baseten, app-level wiring in OpenCode and pi.dev, and immediate pressure from the MLX crowd to make Flash fit local hardware too.