Skip to content
AI Primer
TOPIC20 stories

Voice Agents

Realtime spoken conversational agents and voice workflows.

RELEASE11th May
Thinking Machines introduces interaction models with 200 ms full-duplex audio, video, and tool use

Thinking Machines previewed interaction models that process audio, video, and text in 200 ms micro-turns, letting the system listen, speak, and react at the same time. The demos matter because the interaction loop is trained into the model instead of stitched together from separate speech and tool layers.

NEWS9th May
Pi community ships `pi-listens`, `pi-kanban`, and `pi-codex-conversion` in one-day extension burst

Independent Pi builders shipped a voice layer, a kanban and observability dashboard, a Codex-conversion tool with `apply_patch`, and smaller UI extensions in the same window. The burst matters because it turns Pi from a single coding agent into a real local-first extension ecosystem with voice, review, and workflow primitives.

RELEASE7th May
OpenAI adds GPT-Realtime-2, Translate, and Whisper to the Realtime API

OpenAI added GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper to the Realtime API. The update gives voice agents live reasoning, translation, and transcription, but it remains API-only rather than part of ChatGPT voice mode.

NEWS7th May
ElevenLabs cuts Flash TTS 55%, Scribe 45%, and Agents 20% with pay-as-you-go billing

ElevenLabs lowered self-serve pricing for ElevenAPI and ElevenAgents and added pay-as-you-go billing. The biggest listed drops are to $0.05 per 1,000 tokens for Flash TTS, $0.22 for Scribe v2 speech-to-text, and $0.08 per minute for agent calls.

RELEASE1w ago
Realtime TTS-2 releases with sub-200 ms TTFA and 100+ languages

Realtime TTS-2 ships as a low-latency speech model that conditions on prior audio turns, not just text, and claims sub-200 ms time-to-first-audio across 100+ languages. The release matters for voice-agent stacks because Replicate and LiveKit are already exposing it for real-time integration work.

NEWS2w ago
ElevenLabs releases Agent Templates with 50+ support, SDR, and training workflows

ElevenLabs launched Agent Templates, a library of pre-configured conversational agents for support, education, sales, and internal enablement. That shortens the setup path for teams that want to deploy voice or chat agents without starting from a blank flow.

RELEASE2w ago
OpenClaw 2026.4.24 adds voice-call handoff and browser recovery

OpenClaw shipped a release that routes realtime voice queries to the full agent, defaults new users to V4 Flash, and adds coordinate clicks plus stale-lock recovery for browser automation. It also fixes Telegram, Slack, MCP session, and TTS issues, so update if those flows matter to your setup.

RELEASE3w ago
Grok launches STT and TTS APIs with WebSocket streaming and 25-plus languages

Grok added standalone speech-to-text and text-to-speech APIs with WebSocket streaming, word timestamps, diarization, and support for 25-plus languages. Developers building realtime audio apps can now call Grok Voice infrastructure directly instead of wiring it through the app UI.

RELEASE4w ago
Hermes Agent launches Tool Gateway with 300+ models and bundled tools

Hermes Agent added Tool Gateway, bundling 300+ models with web, browser, image, terminal, and TTS tools behind one subscription. Firecrawl, Browser Use, Fal image models, and Gemini Voice shipped at launch.

RELEASE4w ago
Gemini 3.1 Flash TTS launches with Audio Tags, 70+ languages and API preview

Google released Gemini 3.1 Flash TTS with inline Audio Tags, multi-speaker control and 70+ languages, and opened preview access through the Gemini API and AI Studio with rollout to Vertex AI and Google Vids. Independent evals ranked it near the top of current speech leaderboards, but it runs slower and costs more than the leading system.

RELEASE1mo ago
Qwen releases Qwen3.5-Omni with 10-hour audio and 400s video support

Alibaba launched Qwen3.5-Omni across Lite, Flash, Plus, and Plus-Realtime variants for native text, image, audio, and video understanding, plus realtime voice controls and script-level captioning. The family targets long multimodal sessions and live interaction, so watch the understanding-focused limits if you need media generation.

RELEASE1mo ago
Mistral releases Voxtral TTS with 3-second cloning and 68.4% win rate vs ElevenLabs Flash v2.5

Voxtral TTS uses separate semantic and acoustic token models, a 2.14 kbps codec, and 3-25 second reference audio for cloning across nine languages. Try it if you want a hybrid speech pipeline with more control and faster acoustic synthesis than all-autoregressive generation.

RELEASE1mo ago
Mistral launches Voxtral TTS with 9 languages and 90 ms first audio

Mistral released open-weight Voxtral TTS with low-latency streaming, voice cloning, and cross-lingual adaptation, and vLLM Omni shipped day-0 support. Voice-agent teams should compare quality, latency, and serving cost against closed APIs.

RELEASE1mo ago
Gemini 3.1 Flash Live launches with 90.8% audio tool-use score and 128K context

Google launched Gemini 3.1 Flash Live in AI Studio, the API, and Gemini Live with stronger audio tool use, lower latency, and 128K context. Voice-agent teams should benchmark quality, latency, and thinking settings before switching.

RELEASE1mo ago
Cohere launches Transcribe 03-2026 with 14 languages and Apache 2.0 weights

Cohere released a 2B speech-to-text model with 14 languages and top Open ASR scores, and upstreamed encoder-decoder optimizations to vLLM in the same launch. It is a self-hosted ASR option, so test accuracy and throughput on your own speech workload.

RELEASE1mo ago
KittenTTS releases 15M-to-80M ONNX voice models for CPU deployment

KittenTTS released nano, micro, and mini ONNX TTS models sized for CPU-first deployment instead of GPU-heavy stacks. Voice-agent builders should benchmark both dependency weight and real-time latency before treating tiny size as enough.

RELEASE1mo ago
Perplexity releases Comet on iOS with voice mode and agentic browsing

Perplexity released Comet for iPhone, bringing its AI-native browser, voice mode, and task-running assistant to mobile. Engineers tracking AI browser UX can now test how agentic browsing behaves as a default mobile browser rather than a desktop-only tool.

RELEASE1mo ago
xAI launches Grok Text-to-Speech API with 5 voices and emotion tags

xAI opened a Grok TTS API with five voices, inline controls for laughter and whispering, and multilingual streaming integrations that quickly landed in LiveKit and fal. Try it for voice products that need real-time playback, telephony formats, and hosted integration paths out of the box.

NEWS1mo ago
Artificial Analysis ranks Nemotron 3 VoiceChat at 77.8% conversational dynamics

Artificial Analysis published results for NVIDIA's Nemotron 3 VoiceChat, putting the 12B model at the open-weight pareto frontier across conversational dynamics and speech reasoning. Consider it for open voice agents, but compare against proprietary systems that still lead the category by a wide margin.

NEWS2mo ago
Together AI launches unified voice stack with co-located STT, LLM, and TTS

Together AI launched a single-cloud stack for realtime voice agents that hosts Deepgram, Cartesia, MiniMax, and other voice components on one platform. Use it to cut latency and deployment overhead if you want one billing surface for production voice apps.

AI PrimerAI Primer

Your daily guide to AI tools, workflows, and creative inspiration.

© 2026 AI Primer. All rights reserved.