Hacker News discussion around KittenTTS has shifted to edge deployment, streaming latency, expressive control, and prosody rather than new model changes. The 25MB ONNX footprint keeps it attractive for CPU and on-device use, but voice quality is still the production boundary.

Posted by rohan_joshi
Kitten TTS is an open-source, lightweight text-to-speech library built on ONNX with models from 15M to 80M parameters (25-80 MB). It supports CPU inference without GPU, features 8 built-in voices, adjustable speed, text preprocessing, and 24 kHz output. Latest release v0.8.1 (Feb 2026) includes nano (15M/int8 25MB), micro (40M), and mini (80M) models. Python pip install available, with basic API for generation and file output. Repo has 13k stars, Apache 2.0 license.
KittenTTS looks useful because the packaging is unusually light, not because it solves voice performance. The repo says v0.8.1 ships nano, micro, and mini models, supports eight built-in voices, adjustable speed, text preprocessing, 24 kHz output, and a simple Python install path through the project page.
Posted by rohan_joshi
Thread discussion highlights: - baibai008989 on edge deployment / dependency bloat: the dependency chain issue is a real barrier for edge deployment... anything that pulls torch + cuda makes the whole thing a non-starter. 25MB is genuinely exciting for that use case. - bobokaytop on latency and real-time use: the practical bottleneck for most edge deployments isn't model size -- it's the inference latency on low-power hardware and the audio streaming architecture around it. - altruios on expressive control: One of the core features I look for is expressive control... How does it handle expressive tags?
What the community is stress-testing is the part creatives actually feel in finished work. In the Hacker News discussion, one commenter says 25MB is exciting because dependency bloat can kill edge deployment, while others push on the harder questions: whether inference stays responsive on low-power hardware, whether audio streaming is smooth, and whether expressive control and prosody are good enough for real narration instead of just demo clips, as reflected in the HN thread.
Posted by rohan_joshi
For creatives and voice-tool users, the interesting part is the promise of small, expressive TTS models with multiple voices and on-device running. The discussion centers on whether the voices sound good, how well prosody works, and whether expressive control is strong enough for production narration or voice apps.