The January 2025 upgrade of the 82M-parameter Kokoro open-weight TTS model with expanded multilingual voices.
Access model weights, configuration files, and documentation.
No benchmark data available for this model yet.
See which devices can run this model and at what quality level.
Kokoro v1.0 is an open-weight text-to-speech model by hexgrad, released January 27, 2025 as the major upgrade to the Kokoro-82M series. At 82 million parameters (0.082B), it occupies a unique niche: a TTS model small enough to run on consumer hardware, yet capable of output quality that competes with models an order of magnitude larger. It is not a language model—it is a speech synthesis model that converts text input to audio output.
The v1.0 release expanded the model from a single-language, 10-voice system to 9 languages with 54 voices, trained on several hundred hours of data. Total training cost was approximately $1,000 across 1,000 A100 GPU-hours. The model is licensed under Apache 2.0, meaning commercial deployment, modification, and redistribution are permitted without restriction.
Kokoro v1.0 matters because it collapses the cost floor for production TTS. At under $0.06 per hour of audio output when served via API, and zero marginal cost when run locally, it makes high-quality speech synthesis accessible for high-volume, privacy-sensitive, or offline workloads. The download is roughly 300MB, and it runs on CPU, CUDA GPU, or Apple Silicon.
Kokoro v1.0 is a dense model built on a modified StyleTTS 2 architecture, derived from yl4579/StyleTTS2-LJSpeech. All 82 million parameters are active during inference—there are no expert routing mechanisms or sparse activation patterns. This means VRAM consumption is predictable and proportional to the full model size, with no inference-time memory spikes from load balancing.
The model processes text input and generates audio output. It does not support a context window in the language model sense; input length is bounded by practical audio generation limits rather than a specified context size. The pipeline requires espeak-ng as a phonemizer dependency for text preprocessing.
For local deployment, the model ships in multiple formats. The native PyTorch checkpoint is approximately 300MB. Community ONNX builds are available in three variants: FP32 (310MB), FP16 (169MB), and INT8 (88MB). These ONNX exports enable deployment on platforms without PyTorch support, including mobile and browser environments via ONNX Runtime Web.
The architecture prioritizes inference speed over training efficiency. Despite the small parameter count, the model uses a few hundred hours of training data—modest by modern standards—and achieves its quality through architectural choices rather than data scale. This makes it fast at inference but means it does not benefit from the emergent capabilities seen in larger models trained on thousands of hours of speech.
Kokoro v1.0 supports 9 languages with 26 voices in the v1.0 release: English (US and UK), Spanish, French, Hindi, Italian, Japanese, Brazilian Portuguese, and Mandarin Chinese. The broader voice set across all versions reaches 54 voices. Each voice is a fixed speaker embedding—there is no voice cloning capability and no fine-grained emotion control.
The model excels at straightforward text-to-speech tasks: narrating articles, reading notifications, generating audiobook content, powering voice assistants, and producing audio for accessibility tools. It performs particularly well on English text, where its quality approaches commercial offerings like ElevenLabs for standard reading tasks.
Kokoro v1.0 is not suited for use cases requiring real-time streaming, dynamic emotion modulation, or speaker adaptation. If the job requires generating speech in a new voice not present in the training set, or adjusting prosody on the fly for conversational agents, this is the wrong model. For batch generation of natural-sounding speech from fixed voices, it is among the most cost-effective options available.
The primary advantage of Kokoro v1.0 is that it runs on hardware most practitioners already own. Here is what to expect:
VRAM Requirements
Any GPU with 1GB or more VRAM can run all variants without issue. CPU inference is also practical: the model generates audio at near-real-time speeds on modern x86 processors and Apple Silicon.
Recommended Hardware
Best Quantization for Kokoro v1.0
For most users, the ONNX FP16 variant (169MB) offers the best balance of quality and speed. The INT8 variant (88MB) shows negligible quality degradation for English but may introduce artifacts in tonal languages like Mandarin or Japanese. The FP32 variant provides maximum quality at the cost of larger downloads and higher VRAM usage.
Expected Performance
Getting Started
The quickest path is pip install kokoro and using the KPipeline class. The model weights download automatically on first run. For production deployments, pre-download the ONNX variant and load it directly with ONNX Runtime.
Kokoro v1.0 vs. XTTS v2 (467M parameters): XTTS v2 offers voice cloning and multilingual support with a larger model, but requires 2-4GB VRAM and runs significantly slower. Kokoro v1.0 produces comparable or better naturalness for English text while using one-fifth the parameters and one-tenth the VRAM. Choose XTTS v2 if you need voice cloning. Choose Kokoro v1.0 if you need speed, low resource usage, or English-focused quality.
Kokoro v1.0 vs. MetaVoice v1 (1.2B parameters): MetaVoice offers better prosody control and emotional range but requires 4-8GB VRAM and specialized inference servers. Kokoro v1.0 runs on a laptop CPU. For batch generation of neutral narration, Kokoro v1.0 is more practical. For expressive, character-driven speech, MetaVoice is the better choice.
Kokoro v1.0 vs. ElevenLabs API: ElevenLabs offers superior voice quality, voice cloning, and emotional control—but it is a cloud API with per-character pricing and no offline option. Kokoro v1.0 is free, private, and runs anywhere. The quality gap on standard English narration is narrow enough that many production deployments choose Kokoro v1.0 for cost and latency reasons, reserving ElevenLabs for premium or voice-cloning use cases.