releaseMarch 24, 2026

KittenTTS releases 25MB nano voice model with CPU-only ONNX runtime

KittenTTS 0.8 ships new 15M, 40M and 80M models, including an int8 nano model around 25MB that runs on CPU without GPU. It is a fit for narration, character voices and lightweight assistants that need offline or edge-friendly speech.

Voice Local Inference

2 min read

KittenTTS releases 25MB nano voice model with CPU-only ONNX runtime

TL;DR

KittenTTS 0.8 adds three new open-weight voice models — mini at 80M, micro at 40M, and nano at 15M — and the smallest int8 version lands at about 25MB, according to the repo page.
The release is built around ONNX and CPU inference rather than GPU requirements, which makes it more relevant for offline narration tools, lightweight assistants, and edge-style voice apps, as described in the launch thread.
The creative angle is less “tiny demo” than expressive speech: the discussion around the HN post centers on prosody, number pronunciation, and how much control users get over delivery.
Early practitioner feedback in the discussion summary says deployment size is promising, but low-power latency and streaming architecture may still matter more than model footprint alone.

What shipped

Hacker Newspage560 points183 comments

KittenML/KittenTTS

Posted by rohan_joshi

KittenTTS is an open-source, lightweight text-to-speech library built on ONNX with models from 15M to 80M parameters (25-80 MB), enabling high-quality CPU-based voice synthesis without GPU. Version 0.8 released with new models: mini (80M), micro (40M), nano (15M, int8 25MB). Features text preprocessing, Python API, demo on Hugging Face Spaces. Apache 2.0 licensed, developer preview. Commercial support available via Stellon Labs.

Open linked page Open HN thread

KittenTTS 0.8 ships as an Apache 2.0, ONNX-based text-to-speech library with a Python API, text preprocessing, and a Hugging Face demo, per the project page. For creators, the key update is the model spread: an 80M mini, 40M micro, and 15M nano, with the smallest int8 build coming in around 25MB. That makes the release unusually compact for voice workflows that need local synthesis instead of cloud calls.

The project positioning in the launch thread also leans toward usable expressive speech, not just bare intelligibility. The stated focus on prosody and pronunciation is what makes this more interesting for narration, character voices, and embedded voice agents than a generic “small TTS model” drop.

What the early caveats look like

Hacker Newsdiscussion560 points183 comments

Discussion around Show HN: Three new Kitten TTS models – smallest less than 25MB

Posted by rohan_joshi

Thread discussion highlights: - baibai008989 on edge deployment barrier: The dependency chain issue is a real barrier for edge deployment... anything that pulls torch + cuda makes the whole thing a non-starter. 25MB is genuinely exciting for that use case. - bobokaytop on latency tradeoff: the practical bottleneck for most edge deployments isn't model size -- it's the inference latency on low-power hardware and the audio streaming architecture around it. - altruios on expressive control: One of the core features I look for is expressive control... How does it handle expressive tags?

Discussed by

baibai008989 on edge deployment barrier
bobokaytop on latency tradeoff
altruios on expressive control

Open HN thread Open HN thread

The first practical read from the Hacker News discussion is that package size solves only part of the problem. In the thread summary, one commenter calls a 25MB model genuinely exciting for edge deployment because it avoids the usual Torch-and-CUDA dependency chain, while another says inference latency on low-power hardware and audio streaming design are still the real bottlenecks.

There is at least one concrete integration datapoint: a commenter cited in the main thread says they wired the repo into Discord voice messages within minutes and saw about 1.5x realtime on an Intel 9700 CPU using the 80M model. The same discussion also raises open questions about expressive tags and fine-grained delivery control, which are still the make-or-break details for creative voice work.

🧾 More sources

Hacker Newscore560 points183 comments

Show HN: Three new Kitten TTS models – smallest less than 25MB

Posted by rohan_joshi

For creators, the interesting part is that these tiny models are aimed at usable expressive speech rather than just bare synthesis. The thread focuses on voice quality, prosody, pronunciation of numbers, and how much control users get over expressive delivery—useful context for anyone building voice content, narration tools, or character/assistant voices.

Discussed by