KittenML's latest open-source TTS release spans 15M to 80M models, with the smallest coming in under 25MB and the larger one reportedly running faster than realtime on CPU. Audio creators should test pronunciation and install overhead before betting on it for edge or local voice tools.

Posted by rohan_joshi
Kitten TTS is an open-source, lightweight text-to-speech library built on ONNX. Latest v0.8 release (Feb 2026) offers models from 15M (25MB int8) to 80M parameters (80MB), running high-quality synthesis on CPU without GPU. Features text preprocessing, Python API (pip install wheel), Hugging Face models (e.g., kitten-tts-nano-0.8), browser demo on HF Spaces. Apache-2.0 licensed, developer preview with commercial support available. Future: multilingual TTS, KittenASR.
According to the GitHub page, KittenTTS v0.8 is an open-source ONNX text-to-speech library with model sizes from 15M to 80M parameters. The smallest int8 model is listed at 25MB, while the larger 80M model is framed as high-quality synthesis that can run on CPU without a GPU. For creative tooling, the practical package is the Python API, downloadable Hugging Face checkpoints, and a browser demo linked from the same project page.
Posted by rohan_joshi
Thread discussion highlights: - deathanatos on dependency bloat / torch CUDA: "It pulls in NVIDIA libs... I literally run out of disk trying to install this on Linux." - baibai008989 on edge deployment: "the dependency chain issue is a real barrier for edge deployment... 25MB is genuinely exciting for that use case." - bobokaytop on latency / realtime performance: "running on an intel 9700 CPU, it's about 1.5x realtime using the 80M model. It wasn't any faster running on a 3080 GPU though."
The strongest creative angle is local voice generation where size and runtime matter more than studio-grade polish. In the discussion roundup, one user reports about 1.5x realtime on an Intel 9700 CPU with the 80M model, while another calls a 25MB model genuinely exciting for edge deployment because dependency chains often block small-device shipping.
The same thread also shows why audio teams should test before committing. A commenter in the main HN thread says Linux installation pulled in enough NVIDIA libraries to become a disk problem, and another reports that number pronunciation degraded into noise. That makes v0.8 more compelling as an experimental local voice layer than a drop-in production narrator.
Posted by rohan_joshi
Relevant for creatives working with voice and audio production: the thread is about expressive text-to-speech, voice quality, prosody, pronunciation, and whether very small models can still produce usable spoken output for apps and media workflows.