hexgrad

Kokoro v0.19

An 82M-parameter open-weight English TTS model built on a StyleTTS 2-based architecture.

0.082B paramsDense

A solid 0.082B-parameter dense audio model from hexgrad. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.

Model Specifications

Parameters0.082B

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

No benchmark data available for this model yet.

Overall Score

64.1BB

Benchmark40%

50.0

Popularity25%

86.0

Efficiency25%

64.4

Versatility10%

65.0


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	0.6 GB
Acer Veriton GN100 AI MiniAcer	SS	0.6 GB
AMD Instinct MI300XAMD	SS	0.6 GB
AMD Instinct MI325XAMD	SS	0.6 GB
AMD Instinct MI355XAMD	SS	0.6 GB
AMD Radeon RX 7600 8GBAMD	SS	0.6 GB
AMD Radeon RX 7700 XTAMD	SS	0.6 GB
AMD Radeon RX 7800 XTAMD	SS	0.6 GB
AMD Radeon RX 7900 XTAMD	SS	0.6 GB
AMD Radeon RX 7900 XTXAMD	SS	0.6 GB
AMD Radeon RX 9070AMD	SS	0.6 GB
AMD Radeon RX 9070 XTAMD	SS	0.6 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	0.6 GB
Apple M4Apple	SS	0.6 GB
Apple M4 Max (40-core GPU)Apple	SS	0.6 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	0.6 GB
Apple M5Apple	SS	0.6 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	0.6 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	0.6 GB
Apple Mac Mini (M1, 2020)Apple	SS	0.6 GB
Apple Mac Mini (M2, 2023)Apple	SS	0.6 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	0.6 GB
Apple Mac Mini (M4, 2024)Apple	SS	0.6 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	0.6 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	0.6 GB

About This Model

Kokoro v0.19 is a compact, high-performance text-to-speech (TTS) model developed by hexgrad. At just 82 million parameters, it is designed to bridge the gap between the massive, resource-heavy TTS models used by cloud providers and the lightweight, often robotic-sounding local alternatives. It is built on a StyleTTS 2-based architecture, which allows it to generate natural, human-like English speech with remarkably low latency.

For developers and engineers, Kokoro v0.19 represents a shift toward "edge-first" audio generation. While many TTS models require dedicated server-grade GPUs to achieve real-time factors (RTF) below 1.0, Kokoro v0.19 is small enough to run on almost any modern consumer device. Its Apache 2.0 license further distinguishes it from competitors, providing a truly open-weight solution that can be integrated into commercial applications, local agents, and offline accessibility tools without the burden of restrictive licensing or per-character API costs.

Architecture & Technical Details

The model utilizes a dense architecture with 82 million parameters. It is derived from the StyleTTS 2 framework, specifically leveraging the yl4579/StyleTTS2-LJSpeech base. Unlike traditional autoregressive TTS models that can be slow and prone to "hallucinated" audio artifacts during long sequences, the StyleTTS 2 architecture focuses on style-based latent variables to model the diverse prosody of human speech.

Because it is a dense model rather than a Mixture of Experts (MoE), the VRAM footprint is static and predictable. Every parameter is active during the inference pass, which, at this scale, results in exceptionally high throughput. The model is natively text-only in terms of input modality and is optimized for the English language in this specific version. While the context length is not explicitly capped in the same way as a Large Language Model (LLM), performance and stability are best maintained by processing text in sentence-level or paragraph-level chunks, which the official

Kokoro v0.19 excels in environments where low latency is the primary requirement. Because the model is small enough to fit entirely in the L3 cache of some high-end CPUs or the dedicated VRAM of entry-level GPUs, it is ideal for:

Real-Time Voice Assistants: The sub-100ms "time to first audio" makes it suitable for interactive agents where cloud latency would break the user experience.
Local Screen Readers: Its lightweight nature allows it to run in the background of an OS without impacting the performance of other applications.
Game Development: Developers can use Kokoro to generate dynamic NPC dialogue locally on the player's machine, saving on storage space compared to pre-recorded .wav files.
Automated Content Creation: Processing large volumes of text into audio for podcasts or narrated articles at a fraction of the cost of ElevenLabs or OpenAI TTS.

In this v0.19 release, the model supports 10 distinct voices. While it lacks the advanced "emotional steering" found in massive models, the output is notably less "grainy" than older models like Coqui TTS or Espeak-NG.

Running Kokoro v0.19 locally is trivial compared to LLMs. The hardware requirements are among the lowest in the current AI ecosystem, making it accessible for almost any practitioner.

Minimum VRAM: 2GB (any modern integrated GPU or entry-level discrete card).
Recommended Hardware: Apple Silicon (M1/M2/M3/M4) or NVIDIA RTX 3060 and above. On an RTX 4090, the model can generate audio so quickly that the bottleneck is often the disk I/O for saving the resulting audio file.
CPU Inference: Unlike many LLMs, Kokoro v0.19 performs exceptionally well on standard CPUs. On an M-series Mac, it achieves real-time generation easily using the kokoro pip package.

While quantization (such as Q4_K_M or Q8_0) is the standard for LLMs, it is often unnecessary for Kokoro v0.19. Because the model is only 82M parameters, the memory savings from 4-bit quantization are negligible (reducing a ~170MB file to ~50MB), while the potential for "robotic" artifacts in the audio increases. It is recommended to run this model in FP16 or BF16 to maintain the highest vocal fidelity.

The most efficient way to get started is via the official kokoro library. It requires espeak-ng as a dependency for G2P (Grapheme-to-Phoneme) conversion. For those who prefer a containerized or managed environment, Kokoro is increasingly supported in local inference engines like Ollama, though the standalone Python implementation remains the gold standard for low-latency integration.

Kokoro v0.19 vs. Piper: Piper is another popular local TTS engine. While Piper is even faster and can run on a Raspberry Pi 4, Kokoro v0.19 generally offers higher prosody quality and a more "modern" sound profile. Piper can sound "flat" in comparison, whereas Kokoro's StyleTTS 2 backbone provides better inflection.
Kokoro v0.19 vs. Bark (Small): Suno’s Bark is a GPT-style generative audio model. While Bark can produce non-verbal sounds (laughter, sighs), it is significantly heavier, slower, and prone to hallucinations. Kokoro is far more stable for long-form reading and requires a fraction of the VRAM.
Kokoro v0.19 vs. v1.0: The v0.19 release is the "legacy" version of the current v1.0. While v1.0 expands the voice library to 54 voices and 8 languages, v0.19 remains a favorite for developers who want a battle-tested, English-centric model with a proven track record of stability in production environments.

For practitioners looking to move away from $15/month TTS subscriptions, Kokoro v0.19 is the most logical entry point for local, high-fidelity speech synthesis. Its balance of 0.082B parameters and high-quality English output makes it the current benchmark for edge-deployed TTS.

Kokoro v0.19

Our Take

Model Specifications

Quick Start

Download from Hugging Face

License

Performance & Scoring

Benchmarks

Overall Score

Hardware Compatibility

About This Model

Architecture & Technical Details

Related Models

Kokoro v1.0

Find the Best Hardware for This Model

Community

Capabilities & Use Cases

Running Kokoro v0.19 Locally

Hardware Requirements

Quantization and Performance

Implementation

How It Compares