KittenTTS 0.8 ships new 15M, 40M and 80M models, including an int8 nano model around 25MB that runs on CPU without GPU. It is a fit for narration, character voices and lightweight assistants that need offline or edge-friendly speech.

Posted by rohan_joshi
KittenTTS is an open-source, lightweight text-to-speech library built on ONNX with models from 15M to 80M parameters (25-80 MB), enabling high-quality CPU-based voice synthesis without GPU. Version 0.8 released with new models: mini (80M), micro (40M), nano (15M, int8 25MB). Features text preprocessing, Python API, demo on Hugging Face Spaces. Apache 2.0 licensed, developer preview. Commercial support available via Stellon Labs.
KittenTTS 0.8 ships as an Apache 2.0, ONNX-based text-to-speech library with a Python API, text preprocessing, and a Hugging Face demo, per the project page. For creators, the key update is the model spread: an 80M mini, 40M micro, and 15M nano, with the smallest int8 build coming in around 25MB. That makes the release unusually compact for voice workflows that need local synthesis instead of cloud calls.
The project positioning in the launch thread also leans toward usable expressive speech, not just bare intelligibility. The stated focus on prosody and pronunciation is what makes this more interesting for narration, character voices, and embedded voice agents than a generic “small TTS model” drop.
Posted by rohan_joshi
Thread discussion highlights: - baibai008989 on edge deployment barrier: The dependency chain issue is a real barrier for edge deployment... anything that pulls torch + cuda makes the whole thing a non-starter. 25MB is genuinely exciting for that use case. - bobokaytop on latency tradeoff: the practical bottleneck for most edge deployments isn't model size -- it's the inference latency on low-power hardware and the audio streaming architecture around it. - altruios on expressive control: One of the core features I look for is expressive control... How does it handle expressive tags?
The first practical read from the Hacker News discussion is that package size solves only part of the problem. In the thread summary, one commenter calls a 25MB model genuinely exciting for edge deployment because it avoids the usual Torch-and-CUDA dependency chain, while another says inference latency on low-power hardware and audio streaming design are still the real bottlenecks.
There is at least one concrete integration datapoint: a commenter cited in the main thread says they wired the repo into Discord voice messages within minutes and saw about 1.5x realtime on an Intel 9700 CPU using the 80M model. The same discussion also raises open questions about expressive tags and fine-grained delivery control, which are still the make-or-break details for creative voice work.
Posted by rohan_joshi
For creators, the interesting part is that these tiny models are aimed at usable expressive speech rather than just bare synthesis. The thread focuses on voice quality, prosody, pronunciation of numbers, and how much control users get over expressive delivery—useful context for anyone building voice content, narration tools, or character/assistant voices.