KittenTTS supports 25MB ONNX voice models as HN debates prosody
Hacker News discussion around KittenTTS has shifted to edge deployment, streaming latency, expressive control, and prosody rather than new model changes. The 25MB ONNX footprint keeps it attractive for CPU and on-device use, but voice quality is still the production boundary.

TL;DR
- KittenTTS is the new part here: the GitHub page describes three ONNX text-to-speech models from 15M to 80M parameters, with the smallest int8 model landing at 25MB and targeting CPU use without a GPU.
- For creative voice workflows, the HN thread has moved past launch hype to a narrower question: whether these compact models sound expressive enough for narration, character voices, and other production work.
- The strongest upside in the discussion roundup is deployability: commenters call out the small footprint as unusually practical for edge and on-device setups, especially compared with heavier torch-and-CUDA stacks.
- The same HN thread also surfaces the main limit: creators are still asking about prosody, expressive tags, and low-power streaming latency, which suggests quality control remains the real boundary rather than download size.
What is the real creative takeaway?
KittenML/KittenTTS: State-of-the-art TTS model under 25MB
560 upvotes · 182 comments
KittenTTS looks useful because the packaging is unusually light, not because it solves voice performance. The repo says v0.8.1 ships nano, micro, and mini models, supports eight built-in voices, adjustable speed, text preprocessing, 24 kHz output, and a simple Python install path through the project page.
Discussion around Show HN: Three new Kitten TTS models – smallest less than 25MB
560 upvotes · 182 comments
What the community is stress-testing is the part creatives actually feel in finished work. In the Hacker News discussion, one commenter says 25MB is exciting because dependency bloat can kill edge deployment, while others push on the harder questions: whether inference stays responsive on low-power hardware, whether audio streaming is smooth, and whether expressive control and prosody are good enough for real narration instead of just demo clips, as reflected in the HN thread.