releaseApril 15, 2026

Gemini 3.1 Flash TTS adds Audio Tags, 70-language support, and SynthID

Gemini 3.1 Flash TTS added Audio Tags, 70-plus language support, and SynthID watermarking for generated speech. The preview spans Gemini API, AI Studio, Vertex AI, and Google Vids, so teams can test delivery control before adopting it.

3 min read

Gemini 3.1 Flash TTS adds Audio Tags, 70-language support, and SynthID

TL;DR

Google says Gemini 3.1 Flash TTS is its most controllable speech model yet, adding Audio Tags so prompts can steer tone, pacing, and delivery directly inside the script.
According to Google DeepMind's launch thread and the official blog post, the model supports 70+ languages and watermarks every output with SynthID.
Logan Kilpatrick's demo and Google's Cloud prompting guide position the upgrade around scene direction, speaker-level control, and more expressive voices.
Google DeepMind's rollout post says the preview spans the Gemini API, Google AI Studio, Vertex AI, and Google Vids, while the Gemini-TTS docs list the preview model as gemini-3.1-flash-tts-preview.

You can skim the official announcement, steal prompt patterns from Google's Cloud guide, and open the Gemini-TTS docs if you want the exact model ID, output formats, and speaker settings.

Audio Tags

The useful shift is that Google is framing direction as text, not post-processing knobs. In the launch demo, Google DeepMind shows tags and natural-language instructions changing energy and pace mid-line.

Google's Cloud prompting guide says the model supports 200+ audio tags, with a simple pattern: pacing tag, spoken text, expressive tag, spoken text. The same guide lists tags like [whispers], [laughs], [short pause], [fast], and [nervousness], and notes that the tags stay in English even when the spoken text switches languages.

70 languages, plus a quality jump

Google's launch materials keep pairing control with range: 70+ languages, regional variants, and more natural speech. The chart attached in Google DeepMind's thread places Gemini 3.1 Flash TTS at 1211 on Artificial Analysis' TTS Arena, just behind Inworld TTS 1.5 Max at 1215.

That does not settle the leaderboard forever, but it is close enough to make the more interesting part the workflow change: prompts now read more like stage direction than SSML.

Preview surfaces

The rollout is split by audience. According to Google DeepMind, developers get preview access through the Gemini API and Google AI Studio, enterprises get Vertex AI preview, and Google Vids is getting the model for end-user creation workflows.

The Gemini-TTS documentation adds a few practical details missing from the tweets: the preview model ID is gemini-3.1-flash-tts-preview, the docs describe both single-speaker and multi-speaker dialogue generation, and Google's Cloud guide says creators start from 30 prebuilt voices before layering style instructions and inline tags.