breakingMay 29, 2026

Step 3.7 Flash launches with day-one support in Kilo, Modal, SGLang, Hermes, and DesignArena

Step 3.7 Flash landed immediately across Kilo, Modal, SGLang, Hermes-linked tooling, and DesignArena as the model’s 198B MoE, 256K-context release spread through the stack. The breadth of day-one support gives engineers multiple ways to serve, benchmark, and wire the new open-weight multimodal model into agents.

4 min read

Step 3.7 Flash launches with day-one support in Kilo, Modal, SGLang, Hermes, and DesignArena

TL;DR

lmsysorg's launch post framed Step 3.7 Flash as a 198B sparse MoE vision-language model with 256K context, three reasoning levels, and an agent-heavy benchmark story built around coding, tool use, and visual tasks.
The rollout was unusually broad on day one: modal's announcement, vllm_project's support post, OpenRouter's listing, kilocode's Kilo post, and grx_xce's DesignArena repost all put the model on live surfaces within hours.
Serving details were part of the pitch, not an afterthought: vllm_project called out FP8 and NVFP4 weights, speculative decoding, tool calling, and reasoning parsing, while kilocode and grx_xce's DesignArena repost both highlighted roughly 400 tokens per second.
The release is being positioned as an open-weight multimodal agent model rather than a generic chat model, according to lmsysorg's benchmark summary, modal, and OpenRouter.

You can already deploy it on Modal, try the OpenRouter model page, and skim how infra partners described the same release across vLLM and SGLang. The interesting bit is how consistent the packaging was: everyone repeated the same 198B MoE, 11B-active, 256K-context shape, but each surface emphasized a different use case, from high-throughput serving to coding agents to visual UI work.

Model shape

Across the launch posts, the stable spec is a 198B sparse mixture-of-experts model with about 11B active parameters per token, plus a 256K context window. modal and vllm_project's support post both used that same parameterization, which makes the rollout read less like marketing drift and more like a coordinated infra release.

The model was also pitched as natively multimodal. lmsysorg described native multimodal perception, while OpenRouter labeled it image, video, and text capable, and modal explicitly called out image and video understanding.

One product detail kept showing up in nearly every integration post:

3 reasoning levels, per modal and OpenRouter
long-context repo and document handling via 256K context, per modal and vllm_project
an open-weight release posture, per kilocode's post and grx_xce's DesignArena repost

Benchmarks and workload fit

StepFun's launch framing leaned hard on agent efficiency, and the benchmark mix shows what that meant.

ClawEval-1.1: 67.1, ranked #1 according to lmsysorg
SimpleVQA Search: 79.2, ranked #1 according to lmsysorg
V*: 95.3, cited by lmsysorg for visual perception quality
SWE-Bench PRO: 56.3, ranked #2 according to lmsysorg

The task framing around those numbers was unusually concrete. lmsysorg tied the vision scores to turning UIs and charts into code, tied ClawEval to long-horizon tool orchestration, and tied SWE-Bench PRO to tracing repositories, isolating bugs, and shipping patches.

That same positioning carried into partner posts. kilocode called it one of the best open-weight models you can run right now, with multimodal agent behavior at 400 tok/s, while OpenRouter emphasized coding, agentic workflows, and structured outputs.

Day-one rollout surfaces

The ecosystem support landed fast enough that availability became part of the story.

Kilo shipped it on day zero, with kilocode stressing speed and open-weight access
Modal added day-zero support, with modal packaging the core specs for hosted inference
SGLang support was live immediately, according to lmsysorg and lmsysorg's repost of StepFun
vLLM support was also day zero, per vllm_project
OpenRouter listed it the same day, per OpenRouter
DesignArena opened testing quickly, according to grx_xce's repost and another DesignArena repost
Hermes-linked tooling showed early interest, with NousResearch saying they expected it to work well with Hermes Agent

For engineers, that breadth matters more than any single benchmark card. The model showed up at the model API layer, the self-hosting stack, agent tooling, and public eval surfaces at once.

Serving stack details

The most useful implementation details came from infra partners rather than from the recycled benchmark lines.

According to vllm_project, the release shipped with FP8 and NVFP4 quantized weights, built-in MTP speculative decoding, native tool calling, and reasoning parsing. That is a fairly deployment-ready bundle for a same-day open-weight launch.

Modal also published a live example endpoint through its StepFun inference example, and OpenRouter exposed a public model page immediately. Together with kilocode's weekly roundup, which grouped Step 3.7 Flash with other aggressive price-to-performance releases that week, the rollout looked designed to make the model easy to benchmark from several directions on day one.

TL;DR

Model shape

Benchmarks and workload fit

Day-one rollout surfaces

Serving stack details

Discussion across the web