releaseMay 21, 2026

ElevenLabs claims Speech Engine adds 70-plus voice languages to agents

A sponsored explainer thread described Speech Engine as a WebSocket layer that adds speech-to-text, turn detection, interruption handling, and text-to-speech to existing LLM agents. The pitch is that teams can keep their current model stack and add voice without rebuilding the whole agent.

3 min read

ElevenLabs claims Speech Engine adds 70-plus voice languages to agents

TL;DR

ElevenLabs is pitching Speech Engine as a voice layer for existing agents, and minchoi's sponsored thread says it works with OpenAI, Claude, Gemini, or a custom stack across 70-plus voice languages.
According to minchoi's feature list, the layer bundles speech-to-text, text-to-speech, turn detection, interruption handling, voice activity detection, audio orchestration, and streaming responses, plus support for 90-plus transcription languages.
minchoi's flow diagram frames the handoff simply: ElevenLabs handles transcription and playback, while your server still owns the transcript history and the LLM response.
Setup starts with npx skills add elevenlabs/skills --skill speech-engine, then a WebSocket URL, as shown in minchoi's setup post.

You can open ElevenLabs' Speech Engine page, skim minchoi's feature inventory, and the cleanest product distinction is in minchoi's Speech Engine versus ElevenAgents post. The whole pitch is pretty direct: keep your current agent brain, swap in a voice front end, and keep the server-side control path.

Voice stack

The thread's most useful detail is the actual boundary of the product. Speech Engine is not just text-to-speech.

It is described as a bundled real-time voice stack with:

speech-to-text
text-to-speech
turn detection
interruption handling
voice activity detection
audio orchestration
streaming responses
70-plus voice languages
90-plus transcription languages

A second post adds the runtime behavior ElevenLabs wants developers to care about: natural turn-taking, mid-sentence interruptions, low-latency transcription, WebSocket conversation sessions, browser and mobile voice sessions, and compatibility with existing LLM, RAG, and tool setups.

WebSocket handoff

The simplified architecture is the selling point. According to minchoi's flow diagram, the user speaks, ElevenLabs transcribes, your server receives the transcript plus history, your LLM generates the answer, and ElevenLabs speaks it back.

That division keeps the model layer outside ElevenLabs' control path. minchoi's summary post states the pitch even more plainly: you keep the brain, ElevenLabs handles the voice.

The implementation detail surfaced in the thread is a WebSocket endpoint. minchoi's setup post shows the install command and a speechEngine.create() call that points to wss://your-server.com/ws, which suggests Speech Engine expects teams to plug their existing backend into a live audio session rather than rebuild the agent around a new hosted runtime.

Product split

The last useful distinction is product scope. minchoi's Speech Engine versus ElevenAgents post says Speech Engine is for teams that already have a chat agent, want to keep their own LLM, use custom RAG or tools, and need server-side control.

The same post says ElevenAgents is the more managed option, where ElevenLabs handles the LLM, tools, knowledge base, and deployment. That makes the launch less about a new voice bot and more about a thinner integration layer for people who already built the bot they want.

TL;DR

Voice stack

WebSocket handoff

Product split

Discussion across the web