KittenML's latest open-source TTS release spans 15M to 80M models, with the smallest coming in under 25MB and the larger one reportedly running faster than realtime on CPU. Audio creators should test pronunciation and install overhead before betting on it for edge or local voice tools.

Posted by rohan_joshi
Kitten TTS is an open-source, lightweight text-to-speech library built on ONNX. Latest v0.8 release (Feb 2026) offers models from 15M (25MB int8) to 80M parameters (80MB), running high-quality synthesis on CPU without GPU. Features text preprocessing, Python API (pip install wheel), Hugging Face models (e.g., kitten-tts-nano-0.8), browser demo on HF Spaces. Apache-2.0 licensed, developer preview with commercial support available. Future: multilingual TTS, KittenASR.
According to the GitHub page, KittenTTS v0.8 is an open-source ONNX text-to-speech library with model sizes from 15M to 80M parameters. The smallest int8 model is listed at 25MB, while the larger 80M model is framed as high-quality synthesis that can run on CPU without a GPU. For creative tooling, the practical package is the Python API, downloadable Hugging Face checkpoints, and a browser demo linked from the same project page.
Posted by rohan_joshi
Thread discussion highlights: - deathanatos on dependency bloat / torch CUDA: "It pulls in NVIDIA libs... I literally run out of disk trying to install this on Linux." - baibai008989 on edge deployment: "the dependency chain issue is a real barrier for edge deployment... 25MB is genuinely exciting for that use case." - bobokaytop on latency / realtime performance: "running on an intel 9700 CPU, it's about 1.5x realtime using the 80M model. It wasn't any faster running on a 3080 GPU though."
The strongest creative angle is local voice generation where size and runtime matter more than studio-grade polish. In the discussion roundup, one user reports about 1.5x realtime on an Intel 9700 CPU with the 80M model, while another calls a 25MB model genuinely exciting for edge deployment because dependency chains often block small-device shipping.
The same thread also shows why audio teams should test before committing. A commenter in the main HN thread says Linux installation pulled in enough NVIDIA libraries to become a disk problem, and another reports that number pronunciation degraded into noise. That makes v0.8 more compelling as an experimental local voice layer than a drop-in production narrator.
Posted by rohan_joshi
Relevant for creatives working with voice and audio production: the thread is about expressive text-to-speech, voice quality, prosody, pronunciation, and whether very small models can still produce usable spoken output for apps and media workflows.
Tongyi Lab opened Fun-CineForge with multi-speaker dubbing, temporal modality for off-screen or blocked faces, and a full dataset-building pipeline. It matters for dialogue and localization workflows that break on hard cuts, overlapping speech, or missing lip cues.
releaseTopview added Seedance 2.0 to Agent V2, pairing multi-scene generation with a storyboard timeline and Business Annual access billed as 365 days of unlimited generations. That moves longform video workflows toward editable sequences instead of stitched clips.
workflowCreators are moving from V8 calibration complaints to darker film-still scenes, fashion shots, and worldbuilding tests, with ECLIPTIC remakes showing stronger depth and lighting. Retest saved SREF recipes if you rely on V8 for cinematic ideation.
workflowA shared workflow converts GTA-style stills into photoreal images with Nano Banana 2, then animates them in LTX-2.3 Pro 4K using detailed material, skin, vehicle, and camera prompts. Try it for trailer-style previsualization if you want more control at lower cost.
workflowShared Nano Banana 2 workflows now cover turnaround sheets, distinctive facial traits, and photoreal rerenders that keep the framing of a reference image. Use one prompt grammar for concept art, editorial portraits, and animation prep.