hexgrad

Kokoro v0.19

An 82M-parameter open-weight English TTS model built on a StyleTTS 2-based architecture.

0.082B paramsDense

View on Hugging Face Source Code Official Page

Model Specifications

Parameters0.082B

ArchitectureDense

Providerhexgrad

Download Size1.2 GB

Community

Monthly Downloads9.7M

Likes6.1K

Last Updated1 years ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

No benchmark data available for this model yet.

Overall Score

64.1BB

Benchmark40%

50.0

Popularity25%

86.0

Efficiency25%

64.4

Versatility10%

65.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

85 devices


Acer Veriton GN100 AI MiniAcer	SS	0.6 GB
AMD Instinct MI300XAMD	SS	0.6 GB
AMD Instinct MI325XAMD	SS	0.6 GB
AMD Instinct MI355XAMD	SS	0.6 GB
AMD Radeon RX 7600 8GBAMD	SS	0.6 GB
AMD Radeon RX 7700 XTAMD	SS	0.6 GB
AMD Radeon RX 7800 XTAMD	SS	0.6 GB
AMD Radeon RX 7900 XTAMD	SS	0.6 GB
AMD Radeon RX 7900 XTXAMD	SS	0.6 GB
AMD Radeon RX 9070AMD	SS	0.6 GB
AMD Radeon RX 9070 XTAMD	SS	0.6 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	0.6 GB
Apple M4Apple	SS	0.6 GB
Apple M4 Max (40-core GPU)Apple	SS	0.6 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	0.6 GB
Apple M5Apple	SS	0.6 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	0.6 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	0.6 GB
Apple Mac Mini (M1, 2020)Apple	SS	0.6 GB
Apple Mac Mini (M2, 2023)Apple	SS	0.6 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	0.6 GB
Apple Mac Mini (M4, 2024)Apple	SS	0.6 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	0.6 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	0.6 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	SS	0.6 GB

Rows per page

Page 1 of 4

About This Model

Kokoro v0.19 is a compact, high-performance text-to-speech (TTS) model developed by hexgrad. At just 82 million parameters, it is designed to bridge the gap between the massive, resource-heavy TTS models used by cloud providers and the lightweight, often robotic-sounding local alternatives. It is built on a StyleTTS 2-based architecture, which allows it to generate natural, human-like English speech with remarkably low latency.

For developers and engineers, Kokoro v0.19 represents a shift toward "edge-first" audio generation. While many TTS models require dedicated server-grade GPUs to achieve real-time factors (RTF) below 1.0, Kokoro v0.19 is small enough to run on almost any modern consumer device. Its Apache 2.0 license further distinguishes it from competitors, providing a truly open-weight solution that can be integrated into commercial applications, local agents, and offline accessibility tools without the burden of restrictive licensing or per-character API costs.

Architecture & Technical Details

The model utilizes a dense architecture with 82 million parameters. It is derived from the StyleTTS 2 framework, specifically leveraging the yl4579/StyleTTS2-LJSpeech base. Unlike traditional autoregressive TTS models that can be slow and prone to "hallucinated" audio artifacts during long sequences, the StyleTTS 2 architecture focuses on style-based latent variables to model the diverse prosody of human speech.

Because it is a dense model rather than a Mixture of Experts (MoE), the VRAM footprint is static and predictable. Every parameter is active during the inference pass, which, at this scale, results in exceptionally high throughput. The model is natively text-only in terms of input modality and is optimized for the English language in this specific version. While the context length is not explicitly capped in the same way as a Large Language Model (LLM), performance and stability are best maintained by processing text in sentence-level or paragraph-level chunks, which the official kokoro Python library handles via internal phonemization.

Capabilities & Use Cases

Kokoro v0.19 excels in environments where low latency is the primary requirement. Because the model is small enough to fit entirely in the L3 cache of some high-end CPUs or the dedicated VRAM of entry-level GPUs, it is ideal for:

Real-Time Voice Assistants: The sub-100ms "time to first audio" makes it suitable for interactive agents where cloud latency would break the user experience.
Local Screen Readers: Its lightweight nature allows it to run in the background of an OS without impacting the performance of other applications.
Game Development: Developers can use Kokoro to generate dynamic NPC dialogue locally on the player's machine, saving on storage space compared to pre-recorded .wav files.
Automated Content Creation: Processing large volumes of text into audio for podcasts or narrated articles at a fraction of the cost of ElevenLabs or OpenAI TTS.

In this v0.19 release, the model supports 10 distinct voices. While it lacks the advanced "emotional steering" found in massive models, the output is notably less "grainy" than older models like Coqui TTS or Espeak-NG.

Running Kokoro v0.19 Locally

Running Kokoro v0.19 locally is trivial compared to LLMs. The hardware requirements are among the lowest in the current AI ecosystem, making it accessible for almost any practitioner.

Hardware Requirements

To run Kokoro v0.19, you do not need a flagship GPU. The model weights in FP16 take up less than 200MB of space.

Minimum VRAM: 2GB (any modern integrated GPU or entry-level discrete card).
Recommended Hardware: Apple Silicon (M1/M2/M3/M4) or NVIDIA RTX 3060 and above. On an RTX 4090, the model can generate audio so quickly that the bottleneck is often the disk I/O for saving the resulting audio file.
CPU Inference: Unlike many LLMs, Kokoro v0.19 performs exceptionally well on standard CPUs. On an M-series Mac, it achieves real-time generation easily using the kokoro pip package.

Quantization and Performance

While quantization (such as Q4_K_M or Q8_0) is the standard for LLMs, it is often unnecessary for Kokoro v0.19. Because the model is only 82M parameters, the memory savings from 4-bit quantization are negligible (reducing a ~170MB file to ~50MB), while the potential for "robotic" artifacts in the audio increases. It is recommended to run this model in FP16 or BF16 to maintain the highest vocal fidelity.

Implementation

The most efficient way to get started is via the official kokoro library. It requires espeak-ng as a dependency for G2P (Grapheme-to-Phoneme) conversion. For those who prefer a containerized or managed environment, Kokoro is increasingly supported in local inference engines like Ollama, though the standalone Python implementation remains the gold standard for low-latency integration.

How It Compares

When evaluating Kokoro v0.19, it is best compared against other small-scale TTS models rather than 7B+ parameter multi-modal models.

Kokoro v0.19 vs. Piper: Piper is another popular local TTS engine. While Piper is even faster and can run on a Raspberry Pi 4, Kokoro v0.19 generally offers higher prosody quality and a more "modern" sound profile. Piper can sound "flat" in comparison, whereas Kokoro's StyleTTS 2 backbone provides better inflection.
Kokoro v0.19 vs. Bark (Small): Suno’s Bark is a GPT-style generative audio model. While Bark can produce non-verbal sounds (laughter, sighs), it is significantly heavier, slower, and prone to hallucinations. Kokoro is far more stable for long-form reading and requires a fraction of the VRAM.
Kokoro v0.19 vs. v1.0: The v0.19 release is the "legacy" version of the current v1.0. While v1.0 expands the voice library to 54 voices and 8 languages, v0.19 remains a favorite for developers who want a battle-tested, English-centric model with a proven track record of stability in production environments.

For practitioners looking to move away from $15/month TTS subscriptions, Kokoro v0.19 is the most logical entry point for local, high-fidelity speech synthesis. Its balance of 0.082B parameters and high-quality English output makes it the current benchmark for edge-deployed TTS.

Related Models

hexgrad

Kokoro v1.0

0.082BDense

0.082B

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.

0.082B

hexgrad

Kokoro v0.19

An 82M-parameter open-weight English TTS model built on a StyleTTS 2-based architecture.

0.082B paramsDense

View on Hugging Face Source Code Official Page

Model Specifications

Parameters0.082B

ArchitectureDense

Providerhexgrad

Download Size1.2 GB

Community

Monthly Downloads9.7M

Likes6.1K

Last Updated1 years ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

No benchmark data available for this model yet.

Overall Score

64.1BB

Benchmark40%

50.0

Popularity25%

86.0

Efficiency25%

64.4

Versatility10%

65.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

85 devices


Acer Veriton GN100 AI MiniAcer	SS	0.6 GB
AMD Instinct MI300XAMD	SS	0.6 GB
AMD Instinct MI325XAMD	SS	0.6 GB
AMD Instinct MI355XAMD	SS	0.6 GB
AMD Radeon RX 7600 8GBAMD	SS	0.6 GB
AMD Radeon RX 7700 XTAMD	SS	0.6 GB
AMD Radeon RX 7800 XTAMD	SS	0.6 GB
AMD Radeon RX 7900 XTAMD	SS	0.6 GB
AMD Radeon RX 7900 XTXAMD	SS	0.6 GB
AMD Radeon RX 9070AMD	SS	0.6 GB
AMD Radeon RX 9070 XTAMD	SS	0.6 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	0.6 GB
Apple M4Apple	SS	0.6 GB
Apple M4 Max (40-core GPU)Apple	SS	0.6 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	0.6 GB
Apple M5Apple	SS	0.6 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	0.6 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	0.6 GB
Apple Mac Mini (M1, 2020)Apple	SS	0.6 GB
Apple Mac Mini (M2, 2023)Apple	SS	0.6 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	0.6 GB
Apple Mac Mini (M4, 2024)Apple	SS	0.6 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	0.6 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	0.6 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	SS	0.6 GB

Rows per page

Page 1 of 4

About This Model

Architecture & Technical Details

Capabilities & Use Cases

Real-Time Voice Assistants: The sub-100ms "time to first audio" makes it suitable for interactive agents where cloud latency would break the user experience.
Local Screen Readers: Its lightweight nature allows it to run in the background of an OS without impacting the performance of other applications.
Game Development: Developers can use Kokoro to generate dynamic NPC dialogue locally on the player's machine, saving on storage space compared to pre-recorded .wav files.
Automated Content Creation: Processing large volumes of text into audio for podcasts or narrated articles at a fraction of the cost of ElevenLabs or OpenAI TTS.

Running Kokoro v0.19 Locally

Running Kokoro v0.19 locally is trivial compared to LLMs. The hardware requirements are among the lowest in the current AI ecosystem, making it accessible for almost any practitioner.

Hardware Requirements

To run Kokoro v0.19, you do not need a flagship GPU. The model weights in FP16 take up less than 200MB of space.

Minimum VRAM: 2GB (any modern integrated GPU or entry-level discrete card).
Recommended Hardware: Apple Silicon (M1/M2/M3/M4) or NVIDIA RTX 3060 and above. On an RTX 4090, the model can generate audio so quickly that the bottleneck is often the disk I/O for saving the resulting audio file.
CPU Inference: Unlike many LLMs, Kokoro v0.19 performs exceptionally well on standard CPUs. On an M-series Mac, it achieves real-time generation easily using the kokoro pip package.

Quantization and Performance

Implementation

How It Compares

When evaluating Kokoro v0.19, it is best compared against other small-scale TTS models rather than 7B+ parameter multi-modal models.

Kokoro v0.19 vs. Piper: Piper is another popular local TTS engine. While Piper is even faster and can run on a Raspberry Pi 4, Kokoro v0.19 generally offers higher prosody quality and a more "modern" sound profile. Piper can sound "flat" in comparison, whereas Kokoro's StyleTTS 2 backbone provides better inflection.
Kokoro v0.19 vs. Bark (Small): Suno’s Bark is a GPT-style generative audio model. While Bark can produce non-verbal sounds (laughter, sighs), it is significantly heavier, slower, and prone to hallucinations. Kokoro is far more stable for long-form reading and requires a fraction of the VRAM.
Kokoro v0.19 vs. v1.0: The v0.19 release is the "legacy" version of the current v1.0. While v1.0 expands the voice library to 54 voices and 8 languages, v0.19 remains a favorite for developers who want a battle-tested, English-centric model with a proven track record of stability in production environments.

Related Models

hexgrad

Kokoro v1.0

0.082BDense

0.082B

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.