releaseMarch 29, 2026

Microsoft opens VibeVoice for 60-minute ASR and 300ms streaming TTS

Microsoft opened VibeVoice with 60-minute ASR, speaker-timed transcripts and 300ms streaming TTS across 50+ languages. HN discussion around Kitten TTS shows the same push toward lighter voice stacks, while latency and dependency bloat still matter on edge hardware.

Voice Local Inference

3 min read

Microsoft opens VibeVoice for 60-minute ASR and 300ms streaming TTS

TL;DR

Microsoft opened VibeVoice as an open-source voice stack with long-form ASR and streaming TTS; the launch thread says it can transcribe 60-minute recordings in one pass and generate first audio in about 300ms.
The same repo link post points to structured transcripts with speaker labels and word-level timestamps, which makes the release more useful for podcast edits, interview logging, and dialogue-heavy production than plain text dumps.
VibeVoice also targets multilingual workflows: the launch thread says it supports 50-plus languages, and the [img:0|VibeVoice graphic] shows Microsoft positioning ASR and TTS as one model family.
A parallel HN thread around Kitten TTS shows where creator interest is heading: smaller, CPU-friendly voice models are attractive for narration and bots, but the HN discussion says latency, streaming architecture, and dependency bloat still decide whether a voice stack is usable on edge hardware.

What shipped

Hasan Toor

@hasantoxr

·Follow

🚨 BREAKING: Microsoft just open-sourced a frontier Voice AI that handles 60-minute audio in a single pass. You drop in a recording. It identifies every speaker, timestamps every word, and outputs a full structured transcript with who said what and when. It also does real-time Show more

6:37 PM · Mar 29, 2026

556

Read 22 replies

Microsoft's VibeVoice release is two products in one family: long-context speech recognition and low-latency text-to-speech. In the launch thread, Hasan Toor summarizes the headline numbers as 60-minute single-pass ASR, speaker identification, word timestamps, and roughly 300ms first-audio latency for streaming TTS. The linked GitHub repo adds that the ASR side can output structured transcriptions with timestamps and speaker attribution, and that the stack supports fine-tuning, vLLM-based inference, and 50-plus languages.

For creators, the practical shift is less cleanup between recording and edit. A single recording can be turned into a speaker-separated transcript for rough cuts, while the TTS side is aimed at fast voice generation instead of offline batch export. The [img:0|project graphic] also shows Microsoft framing VibeVoice as a broader open-source voice platform rather than a single demo model.

Why this matters for creative workflows

Hacker Newscore560 points182 comments

Show HN: Three new Kitten TTS models – smallest less than 25MB

Posted by rohan_joshi

This is a compact TTS release focused on expressive voice output at very small sizes, which matters for narration, voice apps, and interactive audio workflows. The discussion centers on whether prosody, pronunciation, and expressive control are good enough to make tiny models usable in real voice production scenarios.

Discussed by

tredre3 on packaging and dependencies
baibai008989 on edge deployment and latency
bobokaytop on edge performance tradeoffs

Open HN thread Open HN thread

VibeVoice lands into a voice market that is splitting in two directions: bigger open models that cover more of the workflow, and tiny models that can run closer to the device. The Kitten TTS page describes Kitten TTS as an ONNX library with 15M to 80M parameter models, CPU inference, eight built-in voices, adjustable speed, and 24 kHz output, with the smallest model under 25MB.

That contrast sharpens the real question for creatives: not just output quality, but where the model can run and how fast it starts talking. In the HN thread, one user reports using Kitten in a Discord bot workflow at about 1.5x realtime on an Intel 9700 CPU, while the HN discussion highlights the recurring blockers on lightweight stacks: installation pulling heavy dependencies, unclear first-chunk latency, and streaming limitations on low-power hardware.

🧾 More sources

TL;DR1 tweets

Summary of the launch facts and the supporting market context around lightweight voice models.

Hacker Newsdiscussion560 points182 comments

Discussion around Show HN: Three new Kitten TTS models – smallest less than 25MB

Posted by rohan_joshi

Thread discussion highlights: - tredre3 on packaging and dependencies: Feedback that installing the Python package pulls in a heavy chain of dependencies, including spaCy, and may drag in torch and NVIDIA CUDA packages that are unnecessary for running the model. - baibai008989 on edge deployment and latency: A Raspberry Pi user says 25MB is exciting for home automation, but asks about first-chunk latency, streaming support, and pronunciation consistency for technical terms. - bobokaytop on edge performance tradeoffs: Notes that model size is impressive, but the practical bottleneck is inference latency and audio streaming architecture on low-power hardware, not just file size.

Discussed by

tredre3 on packaging and dependencies
baibai008989 on edge deployment and latency
bobokaytop on edge performance tradeoffs

Open HN thread Open HN thread

What shipped1 tweets

Core release details for VibeVoice, including the ASR and TTS capabilities relevant to creative production workflows.

Hacker Newspage560 points182 comments

KittenML/KittenTTS: State-of-the-art TTS model under 25MB

Posted by rohan_joshi

Kitten TTS is an open-source, lightweight text-to-speech library built on ONNX with models from 15M to 80M parameters (25-80 MB). It supports CPU inference without GPU, features 8 built-in voices, adjustable speed, text preprocessing, and 24 kHz output. Latest release v0.8.1 (Feb 2026) includes nano, micro, and mini models. Installation via pip, simple Python API for generation. 13k+ stars, Apache 2.0 license.

Open linked page Open HN thread