Skip to content
AI Primer
release

Mistral releases Small 4 119B MoE with 256K context

Mistral shipped Mistral Small 4, a 119B MoE model with 6.5B active parameters, multimodal input, configurable reasoning, and Apache 2.0 weights. Deploy it quickly in existing stacks if you use SGLang or vLLM, which added day-one support.

3 min read
Mistral releases Small 4 119B MoE with 256K context
Mistral releases Small 4 119B MoE with 256K context

TL;DR

  • Mistral has released Mistral Small 4, a 119B mixture-of-experts model with 6.5B active parameters per token, 256K context, text-and-image input, and a single checkpoint that merges instruct, reasoning, and coding/agentic behavior, according to the release chatter and the launch summary.
  • The model is open under Apache 2.0 and is already exposed in Mistral's own stack: a Playground screenshot shows mistral-small-latest mapped to “Mistral Small 4,” while the launch post link points to the official model release.
  • For deployment teams, support landed immediately in both serving ecosystems: SGLang's day-0 post added launch commands plus Mistral-specific tool and reasoning parsers, and the vLLM announcement confirmed verified support on NVIDIA GPUs.
  • Mistral is pitching Small 4 as a step up from its earlier lineup rather than a minor refresh: the benchmark chart shows separate instruct and reasoning scores across GPQA, MMLU-Pro, IFBench, Arena Hard, and MMMU-Pro, while the NVIDIA tie-up post links the launch to a broader open-model partnership.

What shipped

Mistral Small 4 arrived after pre-release signs in a Hugging Face integration PR, which surfaced the core packaging before launch. That material described a “powerful hybrid model” that “unifies” Instruct, Reasoning, and Devstral-style capabilities in one model, with 128 experts, 4 active experts, 119B total parameters, and 6.5B active per token, as shown in the pre-release PR leak and the architecture screenshot.

The shipped model keeps a broad feature surface for one checkpoint: multimodal input with text output, configurable reasoning effort per request, native function calling, JSON output, multilingual support, and a 256K context window. Mistral is also releasing it as open weights under Apache 2.0, and the Hugging Face collection makes clear this is a family of checkpoints rather than a single artifact.

How fast can engineers deploy it

SGLang shipped day-one support with a concrete server command that uses mistralai/Mistral-Small-4-119B-2603 plus --tool-call-parser mistral and --reasoning-parser mistral, which means existing tool-calling pipelines do not need custom glue to expose the model's agentic and hybrid reasoning modes in LMSYS's post. LMSYS also claims “3× more RPS vs Mistral Small 3,” framing the release as a throughput play as much as a capability upgrade in the same announcement.

vLLM also added day-one support, with its launch note calling out MLA attention, tool calling, and configurable reasoning mode, verified on NVIDIA GPUs. The example container config exposes the operational knobs engineers actually care about for rollout — 262144 max model length, Flash Attention MLA backend, tensor parallel size 2, automatic tool choice, and batching settings up to 16,384 tokens and 128 sequences

.

Where Mistral says it improves

Mistral is positioning Small 4 as a consolidation release, not just a smaller checkpoint. The internal comparison chart in the launch materials shows separate instruct and reasoning variants for the same model, with reasoning mode lifting GPQA Diamond from 59.1 to 71.2, MMLU-Pro from 73.5 to 78.0, IFBench from 35.7 to 48.0, and MMMU-Pro from 46.3 to 60.0; Arena Hard improves more modestly from 55.8 to 58.3 as a reposted chart also shows.

That launch landed alongside Mistral's new NVIDIA partnership, which the announcement thread framed as co-developing “frontier open-source AI models.” In practice, that gives Small 4 more than a model-card moment: it shipped with immediate availability in Mistral Playground via the model picker and immediate support in two popular open serving stacks, which is the part most likely to matter for engineering teams evaluating it this week.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 3 threads
TL;DR2 posts
What shipped1 post
Where Mistral says it improves2 posts
Share on X