releaseMay 5, 2026

Realtime TTS-2 releases with sub-200 ms TTFA and 100+ languages

Realtime TTS-2 ships as a low-latency speech model that conditions on prior audio turns, not just text, and claims sub-200 ms time-to-first-audio across 100+ languages. The release matters for voice-agent stacks because Replicate and LiveKit are already exposing it for real-time integration work.

4 min read

Realtime TTS-2 releases with sub-200 ms TTFA and 100+ languages

TL;DR

Inworld says Realtime TTS-2 conditions on the actual audio from prior turns, not just a transcript, and uses that context to carry tone, pacing, and emotional state forward in live conversations, per testingcatalog's launch summary and Inworld's launch post.
The launch claims three headline specs at once: one voice identity across 100+ languages, sub-200 ms time-to-first-audio, and natural-language voice direction instead of preset emotion tags, according to testingcatalog's feature list and Replicate's model page.
Distribution showed up immediately: livekit's post said the model is available on LiveKit Inference, while replicate's post pointed developers to a hosted Replicate endpoint.
Early reaction fixated on latency more than expressiveness. kimmonismus argued that anything above roughly 300 ms is noticeable in a voice agent, which makes Inworld's sub-200 ms claim the number to watch.

You can read Inworld's official announcement, try the Replicate model page, and plug it into LiveKit's Inworld TTS docs. The Replicate page is also where the launch gets more concrete about bracketed steering cues like "[say excitedly]" and the split between 15 production languages and 90+ experimental ones.

Audio context

The architectural claim is simple and useful: TTS-2 takes prior audio turns as input, so it can react to how someone sounded, not only to what their transcript said. Inworld's launch post says the model tracks tone, pacing, and emotional state across the exchange, while livekit described the same behavior as mirroring users' tone, pacing, and emotional range.

That is the part that separates this launch from a standard "more expressive TTS" release. Inworld's post frames it as a conversation-first model rather than a narration model, and TestingCatalog's writeup says it was rebuilt around conversational speech instead of scripted voiceover.

Steering and language coverage

The control surface breaks into three pieces:

Natural-language steering, with free-form bracketed cues such as "[say excitedly]" or "[whisper in a hushed style]," according to Replicate's model page.
Crosslingual voice consistency, with one claimed voice identity across 100+ languages, per testingcatalog's feature list and Inworld's launch post.
A more specific language breakdown on Replicate: 15 production languages, plus 90+ experimental languages, according to Replicate's README.

That last split is easy to miss in the broader launch framing. The headline says 100+ languages, but the hosted model page is the one that distinguishes mature coverage from experimental coverage.

Latency

The launch's hardest number is sub-200 ms time-to-first-audio. testingcatalog's summary and replicate's post both surfaced it as a top-line spec, and kimmonismus called it the metric that matters because lag becomes perceptible above roughly 300 ms in voice agents.

That framing lines up with how other realtime voice stacks talk about the problem. In OpenAI's infrastructure writeup, the company describes low and stable media round-trip time as core to natural turn-taking, with routing and session termination choices driven by latency.

TTS-2 is still only one piece of that budget, but it is the piece developers feel immediately when the model starts speaking too late.

Distribution

This shipped straight into developer surfaces, not only an Inworld demo. livekit linked to LiveKit Inference on launch day, replicate published a hosted model page the same day, and Inworld's announcement says the model is available as a research preview through the Inworld API and Inworld Realtime API.

A secondary integration map shows up in TestingCatalog's writeup, which lists Layercode, LiveKit, NLX, Pipecat, Vapi, and Voximplant as current integrations. For a small voice-model launch, that is Christmas come early for agent-stack builders: the interesting part is not just the model card, but how fast it landed in the plumbing people already use.