Skip to content
AI Primer
TOPIC28 stories

Voice Agents

Realtime spoken conversational agents and voice workflows.

RELEASE23rd June
AssemblyAI launches Universal-3.5 Pro Realtime with Context Carryover

AssemblyAI’s Universal-3.5 Pro Realtime now carries forward the agent side of a conversation to improve live transcription. The release also ships multilingual realtime ASR features, and one early deployment said critical-utterance errors fell from 26% to 9%.

RELEASE1w ago
LiveKit ships Turn Detector v1 with 14-language endpointing

LiveKit released Turn Detector v1 on Cloud and a smaller v1-mini bundled with its Agents SDKs for fast CPU inference. The model predicts end-of-turn directly from speech across 14 languages, changing interruption behavior and latency in voice agents.

RELEASE1w ago
Cartesia releases Sonic-3.5 and Ink-2 for streaming TTS and STT

Cartesia launched Sonic-3.5 for text-to-speech and Ink-2 for speech-to-text, calling them its new top streaming voice models. The release pairs low-latency voice-agent claims with 42-language support and immediate partner availability.

RELEASE2w ago
Zyphra releases ZONOS2: 8B sparse MoE TTS with zero-shot voice cloning

Zyphra released ZONOS2 under Apache 2.0 with 8B total parameters, 900M active, zero-shot voice cloning, 44.1 kHz DAC audio, and ZTTS1-Eval. The release includes open weights, inference code, and eval code, so teams can run real-time multilingual TTS without a hosted-only stack.

NEWS4w ago
Artificial Analysis launches AA-WER Streaming with Cartesia Ink-2 at 3.7% WER

Artificial Analysis launched AA-WER Streaming to benchmark streaming speech-to-text models on accuracy and latency for voice agents. The first leaderboard puts Cartesia Ink-2 and ElevenLabs Scribe v2 on the price-latency frontier, so teams should compare cost against latency before choosing a model.

RELEASE1mo ago
OpenClaw releases 2026.5.20 with Discord voice follow and secret warnings

OpenClaw 2026.5.20 adds Discord voice sessions that follow configured users, plus doctor checks for plaintext secrets in config files. The release also improves xAI headless login, clarifies model status, and fixes stuck Windows installs.

RELEASE1mo ago
ElevenLabs launches Speech Engine at 8¢ per minute for chat-to-voice agents

ElevenLabs launched Speech Engine, a layer that adds transcription, speech synthesis, turn-taking, and interruption handling on top of an existing chat agent. The release pairs SDKs, one-command setup, and 8¢-per-minute pricing for production voice agents.

RELEASE1mo ago
OpenClaw 2026.5.18 ships Grok OAuth, Android Talk Mode, and dialog-aware browser actions

OpenClaw 2026.5.18 shipped Grok OAuth and sidecar auth fixes, realtime Android Talk Mode, Telegram forum-topic delivery fixes, and better browser dialog handling. The release removes several auth and UI dead-ends that can stall long agent runs.

RELEASE1mo ago
Thinking Machines introduces interaction models with 200 ms full-duplex audio, video, and tool use

Thinking Machines previewed interaction models that process audio, video, and text in 200 ms micro-turns, letting the system listen, speak, and react at the same time. The demos matter because the interaction loop is trained into the model instead of stitched together from separate speech and tool layers.

NEWS1mo ago
Pi community ships `pi-listens`, `pi-kanban`, and `pi-codex-conversion` in one-day extension burst

Independent Pi builders shipped a voice layer, a kanban and observability dashboard, a Codex-conversion tool with `apply_patch`, and smaller UI extensions in the same window. The burst matters because it turns Pi from a single coding agent into a real local-first extension ecosystem with voice, review, and workflow primitives.

RELEASE1mo ago
OpenAI adds GPT-Realtime-2, Translate, and Whisper to the Realtime API

OpenAI added GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper to the Realtime API. The update gives voice agents live reasoning, translation, and transcription, but it remains API-only rather than part of ChatGPT voice mode.

NEWS1mo ago
ElevenLabs cuts Flash TTS 55%, Scribe 45%, and Agents 20% with pay-as-you-go billing

ElevenLabs lowered self-serve pricing for ElevenAPI and ElevenAgents and added pay-as-you-go billing. The biggest listed drops are to $0.05 per 1,000 tokens for Flash TTS, $0.22 for Scribe v2 speech-to-text, and $0.08 per minute for agent calls.

RELEASE1mo ago
Realtime TTS-2 releases with sub-200 ms TTFA and 100+ languages

Realtime TTS-2 ships as a low-latency speech model that conditions on prior audio turns, not just text, and claims sub-200 ms time-to-first-audio across 100+ languages. The release matters for voice-agent stacks because Replicate and LiveKit are already exposing it for real-time integration work.

NEWS2mo ago
ElevenLabs releases Agent Templates with 50+ support, SDR, and training workflows

ElevenLabs launched Agent Templates, a library of pre-configured conversational agents for support, education, sales, and internal enablement. That shortens the setup path for teams that want to deploy voice or chat agents without starting from a blank flow.

RELEASE2mo ago
OpenClaw 2026.4.24 adds voice-call handoff and browser recovery

OpenClaw shipped a release that routes realtime voice queries to the full agent, defaults new users to V4 Flash, and adds coordinate clicks plus stale-lock recovery for browser automation. It also fixes Telegram, Slack, MCP session, and TTS issues, so update if those flows matter to your setup.

RELEASE2mo ago
Grok launches STT and TTS APIs with WebSocket streaming and 25-plus languages

Grok added standalone speech-to-text and text-to-speech APIs with WebSocket streaming, word timestamps, diarization, and support for 25-plus languages. Developers building realtime audio apps can now call Grok Voice infrastructure directly instead of wiring it through the app UI.

RELEASE2mo ago
Hermes Agent launches Tool Gateway with 300+ models and bundled tools

Hermes Agent added Tool Gateway, bundling 300+ models with web, browser, image, terminal, and TTS tools behind one subscription. Firecrawl, Browser Use, Fal image models, and Gemini Voice shipped at launch.

RELEASE2mo ago
Gemini 3.1 Flash TTS launches with Audio Tags, 70+ languages and API preview

Google released Gemini 3.1 Flash TTS with inline Audio Tags, multi-speaker control and 70+ languages, and opened preview access through the Gemini API and AI Studio with rollout to Vertex AI and Google Vids. Independent evals ranked it near the top of current speech leaderboards, but it runs slower and costs more than the leading system.

RELEASE3mo ago
Qwen releases Qwen3.5-Omni with 10-hour audio and 400s video support

Alibaba launched Qwen3.5-Omni across Lite, Flash, Plus, and Plus-Realtime variants for native text, image, audio, and video understanding, plus realtime voice controls and script-level captioning. The family targets long multimodal sessions and live interaction, so watch the understanding-focused limits if you need media generation.

RELEASE3mo ago
Mistral releases Voxtral TTS with 3-second cloning and 68.4% win rate vs ElevenLabs Flash v2.5

Voxtral TTS uses separate semantic and acoustic token models, a 2.14 kbps codec, and 3-25 second reference audio for cloning across nine languages. Try it if you want a hybrid speech pipeline with more control and faster acoustic synthesis than all-autoregressive generation.

RELEASE3mo ago
Mistral launches Voxtral TTS with 9 languages and 90 ms first audio

Mistral released open-weight Voxtral TTS with low-latency streaming, voice cloning, and cross-lingual adaptation, and vLLM Omni shipped day-0 support. Voice-agent teams should compare quality, latency, and serving cost against closed APIs.

RELEASE3mo ago
Gemini 3.1 Flash Live launches with 90.8% audio tool-use score and 128K context

Google launched Gemini 3.1 Flash Live in AI Studio, the API, and Gemini Live with stronger audio tool use, lower latency, and 128K context. Voice-agent teams should benchmark quality, latency, and thinking settings before switching.

RELEASE3mo ago
Cohere launches Transcribe 03-2026 with 14 languages and Apache 2.0 weights

Cohere released a 2B speech-to-text model with 14 languages and top Open ASR scores, and upstreamed encoder-decoder optimizations to vLLM in the same launch. It is a self-hosted ASR option, so test accuracy and throughput on your own speech workload.

RELEASE3mo ago
KittenTTS releases 15M-to-80M ONNX voice models for CPU deployment

KittenTTS released nano, micro, and mini ONNX TTS models sized for CPU-first deployment instead of GPU-heavy stacks. Voice-agent builders should benchmark both dependency weight and real-time latency before treating tiny size as enough.

RELEASE3mo ago
Perplexity releases Comet on iOS with voice mode and agentic browsing

Perplexity released Comet for iPhone, bringing its AI-native browser, voice mode, and task-running assistant to mobile. Engineers tracking AI browser UX can now test how agentic browsing behaves as a default mobile browser rather than a desktop-only tool.

RELEASE3mo ago
xAI launches Grok Text-to-Speech API with 5 voices and emotion tags

xAI opened a Grok TTS API with five voices, inline controls for laughter and whispering, and multilingual streaming integrations that quickly landed in LiveKit and fal. Try it for voice products that need real-time playback, telephony formats, and hosted integration paths out of the box.

NEWS3mo ago
Artificial Analysis ranks Nemotron 3 VoiceChat at 77.8% conversational dynamics

Artificial Analysis published results for NVIDIA's Nemotron 3 VoiceChat, putting the 12B model at the open-weight pareto frontier across conversational dynamics and speech reasoning. Consider it for open voice agents, but compare against proprietary systems that still lead the category by a wide margin.

NEWS3mo ago
Together AI launches unified voice stack with co-located STT, LLM, and TTS

Together AI launched a single-cloud stack for realtime voice agents that hosts Deepgram, Cartesia, MiniMax, and other voice components on one platform. Use it to cut latency and deployment overhead if you want one billing surface for production voice apps.

AI PrimerAI Primer

Your daily guide to AI tools, workflows, and creative inspiration.

© 2026 AI Primer. All rights reserved.