hexgrad

Kokoro v1.0

The January 2025 upgrade of the 82M-parameter Kokoro open-weight TTS model with expanded multilingual voices.

0.082B paramsDense

View on Hugging Face Source Code Official Page

Model Specifications

Parameters0.082B

ArchitectureDense

Providerhexgrad

Download Size1.2 GB

Community

Monthly Downloads9.7M

Likes6.1K

Last Updated1 years ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

No benchmark data available for this model yet.

Overall Score

64.1BB

Benchmark40%

50.0

Popularity25%

86.0

Efficiency25%

64.4

Versatility10%

65.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

85 devices


Acer Veriton GN100 AI MiniAcer	SS	0.6 GB
AMD Instinct MI300XAMD	SS	0.6 GB
AMD Instinct MI325XAMD	SS	0.6 GB
AMD Instinct MI355XAMD	SS	0.6 GB
AMD Radeon RX 7600 8GBAMD	SS	0.6 GB
AMD Radeon RX 7700 XTAMD	SS	0.6 GB
AMD Radeon RX 7800 XTAMD	SS	0.6 GB
AMD Radeon RX 7900 XTAMD	SS	0.6 GB
AMD Radeon RX 7900 XTXAMD	SS	0.6 GB
AMD Radeon RX 9070AMD	SS	0.6 GB
AMD Radeon RX 9070 XTAMD	SS	0.6 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	0.6 GB
Apple M4Apple	SS	0.6 GB
Apple M4 Max (40-core GPU)Apple	SS	0.6 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	0.6 GB
Apple M5Apple	SS	0.6 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	0.6 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	0.6 GB
Apple Mac Mini (M1, 2020)Apple	SS	0.6 GB
Apple Mac Mini (M2, 2023)Apple	SS	0.6 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	0.6 GB
Apple Mac Mini (M4, 2024)Apple	SS	0.6 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	0.6 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	0.6 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	SS	0.6 GB

Rows per page

Page 1 of 4

About This Model

Overview

Kokoro v1.0 is an open-weight text-to-speech model by hexgrad, released January 27, 2025 as the major upgrade to the Kokoro-82M series. At 82 million parameters (0.082B), it occupies a unique niche: a TTS model small enough to run on consumer hardware, yet capable of output quality that competes with models an order of magnitude larger. It is not a language model—it is a speech synthesis model that converts text input to audio output.

The v1.0 release expanded the model from a single-language, 10-voice system to 9 languages with 54 voices, trained on several hundred hours of data. Total training cost was approximately $1,000 across 1,000 A100 GPU-hours. The model is licensed under Apache 2.0, meaning commercial deployment, modification, and redistribution are permitted without restriction.

Kokoro v1.0 matters because it collapses the cost floor for production TTS. At under $0.06 per hour of audio output when served via API, and zero marginal cost when run locally, it makes high-quality speech synthesis accessible for high-volume, privacy-sensitive, or offline workloads. The download is roughly 300MB, and it runs on CPU, CUDA GPU, or Apple Silicon.

Architecture & Technical Details

Kokoro v1.0 is a dense model built on a modified StyleTTS 2 architecture, derived from yl4579/StyleTTS2-LJSpeech. All 82 million parameters are active during inference—there are no expert routing mechanisms or sparse activation patterns. This means VRAM consumption is predictable and proportional to the full model size, with no inference-time memory spikes from load balancing.

The model processes text input and generates audio output. It does not support a context window in the language model sense; input length is bounded by practical audio generation limits rather than a specified context size. The pipeline requires espeak-ng as a phonemizer dependency for text preprocessing.

For local deployment, the model ships in multiple formats. The native PyTorch checkpoint is approximately 300MB. Community ONNX builds are available in three variants: FP32 (310MB), FP16 (169MB), and INT8 (88MB). These ONNX exports enable deployment on platforms without PyTorch support, including mobile and browser environments via ONNX Runtime Web.

The architecture prioritizes inference speed over training efficiency. Despite the small parameter count, the model uses a few hundred hours of training data—modest by modern standards—and achieves its quality through architectural choices rather than data scale. This makes it fast at inference but means it does not benefit from the emergent capabilities seen in larger models trained on thousands of hours of speech.

Capabilities & Use Cases

Kokoro v1.0 supports 9 languages with 26 voices in the v1.0 release: English (US and UK), Spanish, French, Hindi, Italian, Japanese, Brazilian Portuguese, and Mandarin Chinese. The broader voice set across all versions reaches 54 voices. Each voice is a fixed speaker embedding—there is no voice cloning capability and no fine-grained emotion control.

The model excels at straightforward text-to-speech tasks: narrating articles, reading notifications, generating audiobook content, powering voice assistants, and producing audio for accessibility tools. It performs particularly well on English text, where its quality approaches commercial offerings like ElevenLabs for standard reading tasks.

Kokoro v1.0 is not suited for use cases requiring real-time streaming, dynamic emotion modulation, or speaker adaptation. If the job requires generating speech in a new voice not present in the training set, or adjusting prosody on the fly for conversational agents, this is the wrong model. For batch generation of natural-sounding speech from fixed voices, it is among the most cost-effective options available.

Running Kokoro v1.0 Locally

The primary advantage of Kokoro v1.0 is that it runs on hardware most practitioners already own. Here is what to expect:

VRAM Requirements

PyTorch FP32: ~400MB VRAM
ONNX FP16: ~170MB VRAM
ONNX INT8: ~90MB VRAM

Any GPU with 1GB or more VRAM can run all variants without issue. CPU inference is also practical: the model generates audio at near-real-time speeds on modern x86 processors and Apple Silicon.

Recommended Hardware

GPU: Any NVIDIA GPU with CUDA support. An RTX 3060 or higher will run the model with headroom for concurrent instances. An RTX 4090 is overkill for a single instance but allows running multiple parallel pipelines.
Apple Silicon: M1, M2, M3, or M4 chips run the ONNX INT8 variant efficiently. The M4 Max handles multiple concurrent generations without thermal throttling.
CPU: 4+ cores, any x86_64 processor from the last five years. Expect 0.5-1x real-time generation speed on CPU.

Best Quantization for Kokoro v1.0

For most users, the ONNX FP16 variant (169MB) offers the best balance of quality and speed. The INT8 variant (88MB) shows negligible quality degradation for English but may introduce artifacts in tonal languages like Mandarin or Japanese. The FP32 variant provides maximum quality at the cost of larger downloads and higher VRAM usage.

Expected Performance

On an RTX 4090 with ONNX FP16: approximately 50-100x real-time generation speed
On an M4 Max with ONNX INT8: approximately 20-40x real-time generation speed
On a modern x86 CPU with ONNX INT8: approximately 0.5-1x real-time generation speed

Getting Started

The quickest path is pip install kokoro and using the KPipeline class. The model weights download automatically on first run. For production deployments, pre-download the ONNX variant and load it directly with ONNX Runtime.

How It Compares

Kokoro v1.0 vs. XTTS v2 (467M parameters): XTTS v2 offers voice cloning and multilingual support with a larger model, but requires 2-4GB VRAM and runs significantly slower. Kokoro v1.0 produces comparable or better naturalness for English text while using one-fifth the parameters and one-tenth the VRAM. Choose XTTS v2 if you need voice cloning. Choose Kokoro v1.0 if you need speed, low resource usage, or English-focused quality.

Kokoro v1.0 vs. MetaVoice v1 (1.2B parameters): MetaVoice offers better prosody control and emotional range but requires 4-8GB VRAM and specialized inference servers. Kokoro v1.0 runs on a laptop CPU. For batch generation of neutral narration, Kokoro v1.0 is more practical. For expressive, character-driven speech, MetaVoice is the better choice.

Kokoro v1.0 vs. ElevenLabs API: ElevenLabs offers superior voice quality, voice cloning, and emotional control—but it is a cloud API with per-character pricing and no offline option. Kokoro v1.0 is free, private, and runs anywhere. The quality gap on standard English narration is narrow enough that many production deployments choose Kokoro v1.0 for cost and latency reasons, reserving ElevenLabs for premium or voice-cloning use cases.

Related Models

hexgrad

Kokoro v0.19

0.082BDense

0.082B

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.

0.082B

hexgrad

Kokoro v1.0

The January 2025 upgrade of the 82M-parameter Kokoro open-weight TTS model with expanded multilingual voices.

0.082B paramsDense

View on Hugging Face Source Code Official Page

Model Specifications

Parameters0.082B

ArchitectureDense

Providerhexgrad

Download Size1.2 GB

Community

Monthly Downloads9.7M

Likes6.1K

Last Updated1 years ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

No benchmark data available for this model yet.

Overall Score

64.1BB

Benchmark40%

50.0

Popularity25%

86.0

Efficiency25%

64.4

Versatility10%

65.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

85 devices


Acer Veriton GN100 AI MiniAcer	SS	0.6 GB
AMD Instinct MI300XAMD	SS	0.6 GB
AMD Instinct MI325XAMD	SS	0.6 GB
AMD Instinct MI355XAMD	SS	0.6 GB
AMD Radeon RX 7600 8GBAMD	SS	0.6 GB
AMD Radeon RX 7700 XTAMD	SS	0.6 GB
AMD Radeon RX 7800 XTAMD	SS	0.6 GB
AMD Radeon RX 7900 XTAMD	SS	0.6 GB
AMD Radeon RX 7900 XTXAMD	SS	0.6 GB
AMD Radeon RX 9070AMD	SS	0.6 GB
AMD Radeon RX 9070 XTAMD	SS	0.6 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	0.6 GB
Apple M4Apple	SS	0.6 GB
Apple M4 Max (40-core GPU)Apple	SS	0.6 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	0.6 GB
Apple M5Apple	SS	0.6 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	0.6 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	0.6 GB
Apple Mac Mini (M1, 2020)Apple	SS	0.6 GB
Apple Mac Mini (M2, 2023)Apple	SS	0.6 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	0.6 GB
Apple Mac Mini (M4, 2024)Apple	SS	0.6 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	0.6 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	0.6 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	SS	0.6 GB

Rows per page

Page 1 of 4

About This Model

Overview

Architecture & Technical Details

Capabilities & Use Cases

Running Kokoro v1.0 Locally

The primary advantage of Kokoro v1.0 is that it runs on hardware most practitioners already own. Here is what to expect:

VRAM Requirements

PyTorch FP32: ~400MB VRAM
ONNX FP16: ~170MB VRAM
ONNX INT8: ~90MB VRAM

Any GPU with 1GB or more VRAM can run all variants without issue. CPU inference is also practical: the model generates audio at near-real-time speeds on modern x86 processors and Apple Silicon.

Recommended Hardware

GPU: Any NVIDIA GPU with CUDA support. An RTX 3060 or higher will run the model with headroom for concurrent instances. An RTX 4090 is overkill for a single instance but allows running multiple parallel pipelines.
Apple Silicon: M1, M2, M3, or M4 chips run the ONNX INT8 variant efficiently. The M4 Max handles multiple concurrent generations without thermal throttling.
CPU: 4+ cores, any x86_64 processor from the last five years. Expect 0.5-1x real-time generation speed on CPU.

Best Quantization for Kokoro v1.0

Expected Performance

On an RTX 4090 with ONNX FP16: approximately 50-100x real-time generation speed
On an M4 Max with ONNX INT8: approximately 20-40x real-time generation speed
On a modern x86 CPU with ONNX INT8: approximately 0.5-1x real-time generation speed

Getting Started

How It Compares

Related Models

hexgrad

Kokoro v0.19

0.082BDense

0.082B

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.