Skip to content
AI Primer
release

Grok launches STT and TTS APIs with WebSocket streaming and 25-plus languages

Grok added standalone speech-to-text and text-to-speech APIs with WebSocket streaming, word timestamps, diarization, and support for 25-plus languages. Developers building realtime audio apps can now call Grok Voice infrastructure directly instead of wiring it through the app UI.

4 min read
Grok launches STT and TTS APIs with WebSocket streaming and 25-plus languages
Grok launches STT and TTS APIs with WebSocket streaming and 25-plus languages

TL;DR

  • xAI's reposted announcement confirms Grok Speech to Text is now live, while benjitaylor's launch post says the new APIs expose the same Grok Voice stack already used in Tesla vehicles and Starlink support.
  • According to ai_for_success's feature rundown, the STT API supports real-time WebSocket streaming, partial transcripts about every 500 ms, word-level timestamps, speaker diarization, and 25-plus languages.
  • In the same ai_for_success thread, Grok TTS ships with five voices, prosody tags such as [pause] and <whisper>, WebSocket or REST delivery, and output formats including MP3, WAV, PCM, μ-law, and A-law.
  • Pricing is unusually aggressive in ai_for_success's post: STT is listed at $0.10 per hour for batch and $0.20 per hour for streaming, while TTS is listed at $4.20 per 1M characters.

You can jump straight to the open-source demo repo, skim benjitaylor's launch note for the stack provenance, and use ai_for_success's thread as the quickest feature inventory. The interesting part is how much of the realtime audio plumbing is exposed directly: streaming endpoints, diarization, inverse text normalization, expressive tags, and server-side key proxying all show up immediately.

What shipped

The launch looks like a clean unbundling of Grok Voice infrastructure into standalone developer APIs. In benjitaylor's post, xAI describes them as new Grok Voice APIs, and in benjitaylor's follow-up he says they run on the same speech stack used in Tesla vehicles and Starlink customer support.

The evidence pool names two surfaces: speech to text and text to speech. ai_for_success's thread presents both as directly callable APIs rather than app-only voice features, with WebSocket support on each side.

Grok STT

The STT side is built for low-latency transcription rather than offline batch cleanup. According to ai_for_success's thread, it returns partial results roughly every 500 ms, then emits a final transcript at utterance end.

The same post lists the core mechanics:

  • Real-time streaming over WebSocket
  • Word-level timestamps
  • Speaker diarization
  • Inverse text normalization, including spoken numbers and currency
  • Support for 25-plus languages
  • Phone-call, meeting, podcast, and telephony use cases
  • Batch pricing at $0.10 per hour, streaming pricing at $0.20 per hour

The GitHub demo summary in the repo link post adds a few implementation details that matter for browser apps: a native 16 kHz capture path, AudioWorklet-based resampling, optional endpointing controls, and utterance-final deduplication.

Grok TTS

The TTS side is more expressive than the usual single-voice REST wrapper. In ai_for_success's thread, the API is described as shipping with five voices, 20-plus languages, streaming over WebSocket or unary REST, and no text length limit on the WebSocket path.

That same source lists the control surface as a set of inline prosody tags, including:

  • [pause]
  • [laugh]
  • [sigh]

Output options in the feature rundown include MP3, WAV, PCM, μ-law, and A-law. Pricing is listed there at $4.20 per 1M characters.

The demo repo exposes the browser architecture

The linked GitHub repository

The open-source playground answers the practical question the launch posts leave open: how xAI expects developers to wire this into a web app. The summary attached to the repo link post says the app uses a Node.js backend as a WebSocket proxy so API keys stay server-side, then streams STT and TTS to a browser UI with live mic capture, audio playback, and realtime status indicators.

That repo summary also surfaces voice inventory that the tweet does not spell out elsewhere, naming eve, ara, rex, sal, and leo as the five TTS voices. It is a small detail, but it turns the launch from a generic API drop into a workable reference implementation for browser-based audio apps.

Further reading

Discussion across the web

Where this story is being discussed, in original context.

On X· 1 thread
What shipped1 post