updateMarch 27, 2026

KittenTTS supports 25MB ONNX voice models as HN debates prosody

Hacker News discussion around KittenTTS has shifted to edge deployment, streaming latency, expressive control, and prosody rather than new model changes. The 25MB ONNX footprint keeps it attractive for CPU and on-device use, but voice quality is still the production boundary.

2 min read

KittenTTS supports 25MB ONNX voice models as HN debates prosody

TL;DR

KittenTTS is the new part here: the GitHub page describes three ONNX text-to-speech models from 15M to 80M parameters, with the smallest int8 model landing at 25MB and targeting CPU use without a GPU.
For creative voice workflows, the HN thread has moved past launch hype to a narrower question: whether these compact models sound expressive enough for narration, character voices, and other production work.
The strongest upside in the discussion roundup is deployability: commenters call out the small footprint as unusually practical for edge and on-device setups, especially compared with heavier torch-and-CUDA stacks.
The same HN thread also surfaces the main limit: creators are still asking about prosody, expressive tags, and low-power streaming latency, which suggests quality control remains the real boundary rather than download size.

What is the real creative takeaway?

Hacker News

KittenML/KittenTTS: State-of-the-art TTS model under 25MB

560 upvotes · 182 comments

KittenTTS looks useful because the packaging is unusually light, not because it solves voice performance. The repo says v0.8.1 ships nano, micro, and mini models, supports eight built-in voices, adjustable speed, text preprocessing, 24 kHz output, and a simple Python install path through the project page.

Hacker News

Discussion around Show HN: Three new Kitten TTS models – smallest less than 25MB

560 upvotes · 182 comments

What the community is stress-testing is the part creatives actually feel in finished work. In the Hacker News discussion, one commenter says 25MB is exciting because dependency bloat can kill edge deployment, while others push on the harder questions: whether inference stays responsive on low-power hardware, whether audio streaming is smooth, and whether expressive control and prosody are good enough for real narration instead of just demo clips, as reflected in the HN thread.

TL;DR

What is the real creative takeaway?

Discussion across the web