An 82M-parameter open-weight English TTS model built on a StyleTTS 2-based architecture.
Access model weights, configuration files, and documentation.
No benchmark data available for this model yet.
See which devices can run this model and at what quality level.
Kokoro v0.19 is a compact, high-performance text-to-speech (TTS) model developed by hexgrad. At just 82 million parameters, it is designed to bridge the gap between the massive, resource-heavy TTS models used by cloud providers and the lightweight, often robotic-sounding local alternatives. It is built on a StyleTTS 2-based architecture, which allows it to generate natural, human-like English speech with remarkably low latency.
For developers and engineers, Kokoro v0.19 represents a shift toward "edge-first" audio generation. While many TTS models require dedicated server-grade GPUs to achieve real-time factors (RTF) below 1.0, Kokoro v0.19 is small enough to run on almost any modern consumer device. Its Apache 2.0 license further distinguishes it from competitors, providing a truly open-weight solution that can be integrated into commercial applications, local agents, and offline accessibility tools without the burden of restrictive licensing or per-character API costs.
The model utilizes a dense architecture with 82 million parameters. It is derived from the StyleTTS 2 framework, specifically leveraging the yl4579/StyleTTS2-LJSpeech base. Unlike traditional autoregressive TTS models that can be slow and prone to "hallucinated" audio artifacts during long sequences, the StyleTTS 2 architecture focuses on style-based latent variables to model the diverse prosody of human speech.
Because it is a dense model rather than a Mixture of Experts (MoE), the VRAM footprint is static and predictable. Every parameter is active during the inference pass, which, at this scale, results in exceptionally high throughput. The model is natively text-only in terms of input modality and is optimized for the English language in this specific version. While the context length is not explicitly capped in the same way as a Large Language Model (LLM), performance and stability are best maintained by processing text in sentence-level or paragraph-level chunks, which the official kokoro Python library handles via internal phonemization.
Kokoro v0.19 excels in environments where low latency is the primary requirement. Because the model is small enough to fit entirely in the L3 cache of some high-end CPUs or the dedicated VRAM of entry-level GPUs, it is ideal for:
.wav files.In this v0.19 release, the model supports 10 distinct voices. While it lacks the advanced "emotional steering" found in massive models, the output is notably less "grainy" than older models like Coqui TTS or Espeak-NG.
Running Kokoro v0.19 locally is trivial compared to LLMs. The hardware requirements are among the lowest in the current AI ecosystem, making it accessible for almost any practitioner.
To run Kokoro v0.19, you do not need a flagship GPU. The model weights in FP16 take up less than 200MB of space.
kokoro pip package.While quantization (such as Q4_K_M or Q8_0) is the standard for LLMs, it is often unnecessary for Kokoro v0.19. Because the model is only 82M parameters, the memory savings from 4-bit quantization are negligible (reducing a ~170MB file to ~50MB), while the potential for "robotic" artifacts in the audio increases. It is recommended to run this model in FP16 or BF16 to maintain the highest vocal fidelity.
The most efficient way to get started is via the official kokoro library. It requires espeak-ng as a dependency for G2P (Grapheme-to-Phoneme) conversion. For those who prefer a containerized or managed environment, Kokoro is increasingly supported in local inference engines like Ollama, though the standalone Python implementation remains the gold standard for low-latency integration.
When evaluating Kokoro v0.19, it is best compared against other small-scale TTS models rather than 7B+ parameter multi-modal models.
For practitioners looking to move away from $15/month TTS subscriptions, Kokoro v0.19 is the most logical entry point for local, high-fidelity speech synthesis. Its balance of 0.082B parameters and high-quality English output makes it the current benchmark for edge-deployed TTS.