KittenTTS released nano, micro, and mini ONNX TTS models sized for CPU-first deployment instead of GPU-heavy stacks. Voice-agent builders should benchmark both dependency weight and real-time latency before treating tiny size as enough.

Posted by rohan_joshi
Kitten TTS is an open-source, lightweight text-to-speech library built on ONNX with models from 15M to 80M parameters (25-80 MB), enabling high-quality CPU-based voice synthesis without GPU. Latest v0.8 release includes nano (15M/25-56MB), micro (40M/41MB), and mini (80M/80MB) models on Hugging Face. Features text preprocessing, basic Python API (pip install from GitHub release), demo on HF Spaces, and commercial support. Apache 2.0 licensed, developer preview.
KittenTTS is shipping as an Apache 2.0 open-source library built on ONNX, with three published model sizes in the current v0.8 release: nano at 15M parameters, micro at 40M, and mini at 80M the repo summary. The project description says those models land in a roughly 25MB-to-80MB footprint range and are meant for "CPU-based voice synthesis without GPU," which puts them closer to embedded or local-agent deployments than to conventional GPU-backed speech stacks GitHub repo.
The release also includes text preprocessing, a basic Python API, Hugging Face-hosted models, and a demo surface, but the repo labels the package a developer preview rather than a finished production runtime the repo summary. That matters because the main novelty here is not just another TTS checkpoint; it is a small-footprint ONNX packaging choice aimed at teams that need voice output where GPU access is expensive, unavailable, or operationally awkward.
Posted by rohan_joshi
The main engineering angle is deployability: tiny ONNX-based TTS models that can run CPU-only on edge hardware, but with real-world concerns around Python dependency size, Torch/CUDA leakage, latency, streaming, and API ergonomics. The thread is useful if you build voice agents or offline inference stacks.
The Hacker News thread immediately focused on the real bottleneck for voice agents: deployment ergonomics rather than raw model weights. In the HN summary, the core concerns were Python dependency size, Torch/CUDA leakage, latency, streaming support, and API shape — the parts that decide whether a small model actually stays small inside a shipping application.
Posted by rohan_joshi
Thread discussion highlights: - dawdler-purge on dependency bloat and CPU-only installs: the dependency chain issue is a real barrier for edge deployment... anything that pulls torch + cuda makes the whole thing a non-starter. - baibai008989 on edge deployment: the dependency chain issue is a real barrier for edge deployment... 25MB is genuinely exciting for that use case. - bobokaytop on latency and performance: Running on an intel 9700 CPU, it's about 1.5x realtime using the 80M model. It wasn't any faster running on a 3080 GPU though.
The most concrete practitioner quote in the discussion recap says "anything that pulls torch + cuda makes the whole thing a non-starter," while another commenter said 25MB is "genuinely exciting" for edge use. That split captures the practical test for this release: a tiny ONNX checkpoint only changes deployment economics if the surrounding install and runtime stay equally lean.
Performance data is still anecdotal. The same discussion recap cites one report of the 80M model running at about 1.5x realtime on an Intel 9700 CPU and "wasn't any faster" on a 3080 GPU. For engineers building offline assistants or embedded voice agents, that makes KittenTTS interesting less as a benchmark winner than as a CPU-first packaging experiment with enough early signal to justify local latency and dependency testing.