Skip to content
AI Primer
TOPIC21 stories

Realtime AI

Low-latency streaming systems for voice, video, and live interaction.

RELEASE23rd June
AssemblyAI launches Universal-3.5 Pro Realtime with Context Carryover

AssemblyAI’s Universal-3.5 Pro Realtime now carries forward the agent side of a conversation to improve live transcription. The release also ships multilingual realtime ASR features, and one early deployment said critical-utterance errors fell from 26% to 9%.

RELEASE22nd June
Vercel supports WebSockets in Fluid with Socket.IO and 30-minute reconnects

Vercel rolled out native WebSocket support so Node.js libraries like Socket.IO can run from CDN to Fluid. Existing sessions still reconnect at the 30-minute function limit, so teams should test long-lived connections before migrating.

RELEASE1w ago
LiveKit ships Turn Detector v1 with 14-language endpointing

LiveKit released Turn Detector v1 on Cloud and a smaller v1-mini bundled with its Agents SDKs for fast CPU inference. The model predicts end-of-turn directly from speech across 14 languages, changing interruption behavior and latency in voice agents.

RELEASE1w ago
Cartesia releases Sonic-3.5 and Ink-2 for streaming TTS and STT

Cartesia launched Sonic-3.5 for text-to-speech and Ink-2 for speech-to-text, calling them its new top streaming voice models. The release pairs low-latency voice-agent claims with 42-language support and immediate partner availability.

RELEASE2w ago
Zyphra releases ZONOS2: 8B sparse MoE TTS with zero-shot voice cloning

Zyphra released ZONOS2 under Apache 2.0 with 8B total parameters, 900M active, zero-shot voice cloning, 44.1 kHz DAC audio, and ZTTS1-Eval. The release includes open weights, inference code, and eval code, so teams can run real-time multilingual TTS without a hosted-only stack.

RELEASE2w ago
Google launches Gemini 3.5 Live Translate for 70+ languages

Google released Gemini 3.5 Live Translate for low-latency speech translation across 70+ languages in the Gemini Live API, AI Studio, and Google Translate. The same model is also heading to Google Meet in private preview for Workspace customers.

NEWS4w ago
Artificial Analysis launches AA-WER Streaming with Cartesia Ink-2 at 3.7% WER

Artificial Analysis launched AA-WER Streaming to benchmark streaming speech-to-text models on accuracy and latency for voice agents. The first leaderboard puts Cartesia Ink-2 and ElevenLabs Scribe v2 on the price-latency frontier, so teams should compare cost against latency before choosing a model.

RELEASE1mo ago
ElevenLabs launches Speech Engine at 8¢ per minute for chat-to-voice agents

ElevenLabs launched Speech Engine, a layer that adds transcription, speech synthesis, turn-taking, and interruption handling on top of an existing chat agent. The release pairs SDKs, one-command setup, and 8¢-per-minute pricing for production voice agents.

NEWS1mo ago
Gemini users report Canvas and Fast mode routing to 3.2 variants ahead of I/O

Multiple users posted reproducible steps and videos showing Gemini app UI changes, Thinking Level rollout, and Fast mode or Canvas sessions that look like 3.2 or 3.5-class routing. This matters because Google appears to be testing new model paths and app surfaces in production ahead of I/O, though the exact model names remain unconfirmed.

RELEASE1mo ago
Thinking Machines introduces interaction models with 200 ms full-duplex audio, video, and tool use

Thinking Machines previewed interaction models that process audio, video, and text in 200 ms micro-turns, letting the system listen, speak, and react at the same time. The demos matter because the interaction loop is trained into the model instead of stitched together from separate speech and tool layers.

RELEASE1mo ago
OpenAI adds GPT-Realtime-2, Translate, and Whisper to the Realtime API

OpenAI added GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper to the Realtime API. The update gives voice agents live reasoning, translation, and transcription, but it remains API-only rather than part of ChatGPT voice mode.

RELEASE1mo ago
Realtime TTS-2 releases with sub-200 ms TTFA and 100+ languages

Realtime TTS-2 ships as a low-latency speech model that conditions on prior audio turns, not just text, and claims sub-200 ms time-to-first-audio across 100+ languages. The release matters for voice-agent stacks because Replicate and LiveKit are already exposing it for real-time integration work.

RELEASE2mo ago
Grok launches STT and TTS APIs with WebSocket streaming and 25-plus languages

Grok added standalone speech-to-text and text-to-speech APIs with WebSocket streaming, word timestamps, diarization, and support for 25-plus languages. Developers building realtime audio apps can now call Grok Voice infrastructure directly instead of wiring it through the app UI.

RELEASE3mo ago
Qwen releases Qwen3.5-Omni with 10-hour audio and 400s video support

Alibaba launched Qwen3.5-Omni across Lite, Flash, Plus, and Plus-Realtime variants for native text, image, audio, and video understanding, plus realtime voice controls and script-level captioning. The family targets long multimodal sessions and live interaction, so watch the understanding-focused limits if you need media generation.

RELEASE3mo ago
Gemini 3.1 Flash Live launches with 90.8% audio tool-use score and 128K context

Google launched Gemini 3.1 Flash Live in AI Studio, the API, and Gemini Live with stronger audio tool use, lower latency, and 128K context. Voice-agent teams should benchmark quality, latency, and thinking settings before switching.

RELEASE3mo ago
Mistral launches Voxtral TTS with 9 languages and 90 ms first audio

Mistral released open-weight Voxtral TTS with low-latency streaming, voice cloning, and cross-lingual adaptation, and vLLM Omni shipped day-0 support. Voice-agent teams should compare quality, latency, and serving cost against closed APIs.

RELEASE3mo ago
KittenTTS releases 15M-to-80M ONNX voice models for CPU deployment

KittenTTS released nano, micro, and mini ONNX TTS models sized for CPU-first deployment instead of GPU-heavy stacks. Voice-agent builders should benchmark both dependency weight and real-time latency before treating tiny size as enough.

RELEASE3mo ago
Hao AI Lab launches Dreamverse: 30s 1080p video in 4.5s on one GPU

Dreamverse paired Hao AI Lab's FastVideo stack with an interface for editing video scenes in a faster-than-playback loop, using quantization and fused kernels to keep latency below viewing time. The stack is interesting if you are building real-time multimodal generation or multi-user video serving.

NEWS3mo ago
Artificial Analysis ranks Nemotron 3 VoiceChat at 77.8% conversational dynamics

Artificial Analysis published results for NVIDIA's Nemotron 3 VoiceChat, putting the 12B model at the open-weight pareto frontier across conversational dynamics and speech reasoning. Consider it for open voice agents, but compare against proprietary systems that still lead the category by a wide margin.

RELEASE3mo ago
xAI launches Grok Text-to-Speech API with 5 voices and emotion tags

xAI opened a Grok TTS API with five voices, inline controls for laughter and whispering, and multilingual streaming integrations that quickly landed in LiveKit and fal. Try it for voice products that need real-time playback, telephony formats, and hosted integration paths out of the box.

NEWS3mo ago
Together AI launches unified voice stack with co-located STT, LLM, and TTS

Together AI launched a single-cloud stack for realtime voice agents that hosts Deepgram, Cartesia, MiniMax, and other voice components on one platform. Use it to cut latency and deployment overhead if you want one billing surface for production voice apps.

AI PrimerAI Primer

Your daily guide to AI tools, workflows, and creative inspiration.

© 2026 AI Primer. All rights reserved.