Stories, products, and related signals connected to this tag in Explore.
Microsoft opened VibeVoice with 60-minute ASR, speaker-timed transcripts and 300ms streaming TTS across 50+ languages. HN discussion around Kitten TTS shows the same push toward lighter voice stacks, while latency and dependency bloat still matter on edge hardware.
Browser demo posts and a Hugging Face release surfaced Cohere Transcribe 2B as part of a wider open-audio week that also featured Voxtral 4B TTS. The model gives creators a multilingual ASR option that can live closer to local or browser workflows.
Hacker News discussion around KittenTTS has shifted to edge deployment, streaming latency, expressive control, and prosody rather than new model changes. The 25MB ONNX footprint keeps it attractive for CPU and on-device use, but voice quality is still the production boundary.
KittenTTS now offers nano, micro and mini text-to-speech models, with the smallest int8 build under 25MB and built for ONNX CPU inference. Creators can run local voice tools without a cloud round trip.
Google says its new realtime voice model improves noisy-environment understanding, long conversations and function calling, and it's rolling into Gemini Live, Search Live and AI Studio. Voice creators can test it for lower-latency spoken interactions.
Smallest says Lightning V3.1 can clone a voice from about 10 seconds of audio with 44.1kHz output, sub-100ms latency and 50-plus languages on Waves. Test it for multilingual narration and dubbing, but get explicit permission before cloning any voice.
KittenTTS 0.8 ships new 15M, 40M and 80M models, including an int8 nano model around 25MB that runs on CPU without GPU. It is a fit for narration, character voices and lightweight assistants that need offline or edge-friendly speech.
KittenML's latest open-source TTS release spans 15M to 80M models, with the smallest coming in under 25MB and the larger one reportedly running faster than realtime on CPU. Audio creators should test pronunciation and install overhead before betting on it for edge or local voice tools.
Variety reports that As Deep as the Grave used generative AI to create Val Kilmer's performance, with material supplied by his family and their backing for the release. For filmmakers, it is an early consent-based case study in digital resurrection where rights and audience expectations matter.
Tongyi Lab opened Fun-CineForge with multi-speaker dubbing, temporal modality for off-screen or blocked faces, and a full dataset-building pipeline. It matters for dialogue and localization workflows that break on hard cuts, overlapping speech, or missing lip cues.
xAI released Grok's Text-to-Speech API with natural voices, expressive controls, and LiveKit support; creators are also using Grok Imagine in reference-image and cartoon animation workflows. Try it if you want Grok in a broader voice-and-motion stack instead of chat alone.
Creators report Seedance 2.0 is being used for wildlife-documentary scenes with built-in narration prompts and character clips with sound effects. Test it if you want a faster path from prompt to finished short without a separate voice pass.
Freepik launched Speak, which turns an image plus text or audio into a lip-synced talking video with 30+ languages and a 5-minute cap. Use it for UGC ads, localized product demos, and fast talking-head tests without reshoots.
Runway opened Characters on its developer platform with API access, custom voices, embedded knowledge, and a free starter allowance. Use it to build interactive hosts, guides, and assistants that can talk through tasks instead of relying on passive video.