An 82M-parameter open-weight English TTS model built on a StyleTTS 2-based architecture.
A solid 0.082B-parameter dense audio model from hexgrad. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
No benchmark data available for this model yet.
See which devices can run this model and at what quality level.
Kokoro v0.19 is a compact, high-performance text-to-speech (TTS) model developed by hexgrad. At just 82 million parameters, it is designed to bridge the gap between the massive, resource-heavy TTS models used by cloud providers and the lightweight, often robotic-sounding local alternatives. It is built on a StyleTTS 2-based architecture, which allows it to generate natural, human-like English speech with remarkably low latency.
For developers and engineers, Kokoro v0.19 represents a shift toward "edge-first" audio generation. While many TTS models require dedicated server-grade GPUs to achieve real-time factors (RTF) below 1.0, Kokoro v0.19 is small enough to run on almost any modern consumer device. Its Apache 2.0 license further distinguishes it from competitors, providing a truly open-weight solution that can be integrated into commercial applications, local agents, and offline accessibility tools without the burden of restrictive licensing or per-character API costs.
The model utilizes a dense architecture with 82 million parameters. It is derived from the StyleTTS 2 framework, specifically leveraging the yl4579/StyleTTS2-LJSpeech base. Unlike traditional autoregressive TTS models that can be slow and prone to "hallucinated" audio artifacts during long sequences, the StyleTTS 2 architecture focuses on style-based latent variables to model the diverse prosody of human speech.
kokoroKokoro v0.19 excels in environments where low latency is the primary requirement. Because the model is small enough to fit entirely in the L3 cache of some high-end CPUs or the dedicated VRAM of entry-level GPUs, it is ideal for:
.wav files.In this v0.19 release, the model supports 10 distinct voices. While it lacks the advanced "emotional steering" found in massive models, the output is notably less "grainy" than older models like Coqui TTS or Espeak-NG.
Running Kokoro v0.19 locally is trivial compared to LLMs. The hardware requirements are among the lowest in the current AI ecosystem, making it accessible for almost any practitioner.
To run Kokoro v0.19, you do not need a flagship GPU. The model weights in FP16 take up less than 200MB of space.
kokoro pip package.While quantization (such as Q4_K_M or Q8_0) is the standard for LLMs, it is often unnecessary for Kokoro v0.19. Because the model is only 82M parameters, the memory savings from 4-bit quantization are negligible (reducing a ~170MB file to ~50MB), while the potential for "robotic" artifacts in the audio increases. It is recommended to run this model in FP16 or BF16 to maintain the highest vocal fidelity.
The most efficient way to get started is via the official kokoro library. It requires espeak-ng as a dependency for G2P (Grapheme-to-Phoneme) conversion. For those who prefer a containerized or managed environment, Kokoro is increasingly supported in local inference engines like Ollama, though the standalone Python implementation remains the gold standard for low-latency integration.
When evaluating Kokoro v0.19, it is best compared against other small-scale TTS models rather than 7B+ parameter multi-modal models.
For practitioners looking to move away from $15/month TTS subscriptions, Kokoro v0.19 is the most logical entry point for local, high-fidelity speech synthesis. Its balance of 0.082B parameters and high-quality English output makes it the current benchmark for edge-deployed TTS.