Voxtral TTS uses separate semantic and acoustic token models, a 2.14 kbps codec, and 3-25 second reference audio for cloning across nine languages. Try it if you want a hybrid speech pipeline with more control and faster acoustic synthesis than all-autoregressive generation.

You can read the official launch post, jump straight to the model page, and skim the API docs for the oddest detail: Mistral treats the reference voice as an instruction signal, so rhythm and emotion come from the prompt clip without extra prosody tags. The paper makes the same point more mechanically, with a split pipeline, a tiny codec, and flow matching for the acoustic side.
Mistral’s main design choice is to separate what is being said from how it sounds. That is the whole story here, and it is a smart break from all-in-one autoregressive speech stacks.
According to the paper and launch materials, Voxtral uses a decoder-only autoregressive transformer for semantic tokens, then a flow-matching model for acoustic tokens. The split gives the first stage the long-range coherence needed for linguistic structure, while the second stage handles prosody and timbre with a generator better suited to continuous audio.
The codec numbers are compact enough to matter on their own.
The thread and paper describe Voxtral Codec as:
That tokenization scheme sits underneath the rest of the model. It is also the cleanest explanation for why Mistral can pair voice cloning with streaming, instead of treating high-quality synthesis and low latency as opposing goals.
The training recipe mirrors the architecture split.
The semantic side is trained with ASR distillation so the token stream tracks real linguistic structure. The acoustic side is modeled continuously with flow matching, and the thread says it can produce high-quality audio in about eight steps.
Mistral also extends DPO across both halves of the system:
That is a more interesting detail than the headline benchmark. It suggests Mistral is treating preference optimization as a multimodal control layer, not just a text-model fine-tuning trick.
The docs are more concrete than the tweet thread about how Voxtral actually ships.
The API docs say Voxtral supports English, French, Spanish, Portuguese, Italian, Dutch, German, Hindi, and Arabic, plus cross-lingual voice cloning and code-mixing. They also give real latency numbers: about 90 ms model processing time, around 0.8 seconds end-to-end time-to-first-audio for PCM, and about 3 seconds for MP3.
The Hugging Face model card adds the deployment picture: 24 kHz output in WAV, PCM, FLAC, MP3, AAC, and Opus, 20 preset voices, BF16 weights, and vLLM-Omni support on a single GPU with at least 16 GB of memory. The release is under CC BY-NC 4.0, inherited from the reference-voice data used to ship the model and sample voices.
1. The key design choice Voxtral doesn't model everything autoregressively, it splits the problem: - A decoder-only autoregressive transformer generates semantic tokens - A flow-matching model generates acoustic tokens. This combines long-range coherence with rich acoustic Show more
2. Factorized semantic + acoustic tokens are built on top of Voxtral Codec, which compresses speech into ultra-low bitrate tokens: • 1 semantic token • 36 acoustic tokens • 12.5 Hz frame rate • ~2.14 kbps total
3. Training is also hybrid. - Semantic tokens are trained with ASR distillation to match real linguistic structure - Acoustic tokens are modeled continuously via flow matching, producing high-quality audio in ~8 steps
4. Voxtral even extends preference optimization (DPO) to this setup: • discrete objective for semantic tokens • continuous objective for acoustic tokens
So the lesson here is to treat speech as two fundamentally different problems and use the right model for each. Paper: arxiv.org/abs/2603.25551 Model weights: huggingface.co/mistralai/Voxt…