Skip to content
AI Primer
release

xAI launches Grok Text-to-Speech API with 5 voices and emotion tags

xAI opened a Grok TTS API with five voices, inline controls for laughter and whispering, and multilingual streaming integrations that quickly landed in LiveKit and fal. Try it for voice products that need real-time playback, telephony formats, and hosted integration paths out of the box.

2 min read
xAI launches Grok Text-to-Speech API with 5 voices and emotion tags
xAI launches Grok Text-to-Speech API with 5 voices and emotion tags

TL;DR

  • xAI has opened a Grok TTS API with five built-in voices and inline controls for delivery, with the launch demo showing Eve, Ara, Leo, Rex, and Sal and a supporting thread describing tags for laughs, whispers, sighs, pauses, and emphasis.
  • The initial rollout is aimed at real-time product use, with LiveKit's integration post calling out low-latency streaming, 20+ languages, and telephony-ready output, while the supporting details add output ranges from 8kHz to 48kHz.
  • The API landed quickly in existing voice stacks: LiveKit added it to LiveKit Inference with "one API key" and no extra setup, and fal's launch exposed hosted access with WebSocket streaming and posted pricing of $0.0042 per 1K characters.

What shipped in Grok TTS?

The first public payload is straightforward: five named voices, multilingual synthesis, and inline expressive controls rather than a separate prosody system. In the xAI demo, the API is presented as "Grok Text-to-Speech API" with Eve, Ara, Leo, Rex, and Sal. A more detailed supporting thread says a single POST call can trigger "laughs, whispers, and sighs on command," with tags for pauses and emphasis as well.

That thread also adds the implementation details engineers usually look for first: auto-detection across 20+ languages and output formats spanning telephony-grade 8kHz through 48kHz audio. The same post claims the voice stack was built in-house, including VAD, tokenizer, and audio models, which matters mostly as a signal that xAI is shipping a full speech stack rather than a thin wrapper.

Where can you use it already?

The fastest ecosystem pickup came from LiveKit. In its integration announcement, LiveKit says Grok TTS is available inside LiveKit Inference with "natural, expressive voices," low-latency streaming, and 20+ languages. The linked plugin guide shows two paths: through LiveKit Inference, or directly against xAI via the livekit-agents[xai] plugin and an API key.

fal shipped a hosted endpoint at nearly the same time. According to fal's post, the service includes real-time WebSocket streaming, the same five-voice setup with inline emotion tags, and published pricing at $0.0042 per 1K characters. That gives teams at least two off-the-shelf integration routes on day one: agent-oriented voice sessions through LiveKit and direct hosted inference through fal.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 2 threads
TL;DR1 post
What shipped in Grok TTS?1 post
Share on X