Versatile instant voice-cloning TTS that replicates a reference speaker's tone color from a short audio clip and supports cross-lingual generation.
Access model weights, configuration files, and documentation.
No benchmark data available for this model yet.
See which devices can run this model and at what quality level.
OpenVoice is an instant voice cloning text-to-speech (TTS) model developed by MyShell AI in collaboration with MIT and Tsinghua University. It replicates a reference speaker’s tone color from a short audio clip—typically 30 seconds or less—and generates speech in multiple languages while preserving the cloned voice. The model is released under the MIT license, making it free for commercial use.
What sets OpenVoice apart from other TTS models is its separation of tone color from voice style. Most voice cloning systems treat voice as a monolithic attribute. OpenVoice decouples them, giving you granular control over emotion, accent, rhythm, pauses, and intonation independently of the cloned voice. This means you can clone a speaker’s voice and make it sound happy, sad, or accented without losing the original timbre.
The architecture is dense with undisclosed parameters. MyShell AI has not published the parameter count, which is common for commercial TTS models. It is text-only in terms of input modality—you provide text and a reference audio clip, and it outputs synthesized speech. The context length is not specified, but in practice, it handles typical sentence-to-paragraph-length inputs without issue.
OpenVoice has been powering MyShell’s instant voice cloning since May 2023, processing tens of millions of voice cloning requests. V2, released in April 2024, improves audio quality and adds native support for English, Spanish, French, Chinese, Japanese, and Korean.
OpenVoice uses a two-stage architecture that separates tone color cloning from style control. The first stage extracts the reference speaker’s tone color from the short audio sample. The second stage generates speech in the target language while applying the extracted tone color and any specified style parameters.
This separation is computationally efficient. According to the authors, OpenVoice costs “tens of times less” than commercial APIs that deliver inferior performance. The model runs on a single GPU and does not require the massive compute resources typical of end-to-end neural TTS systems.
Because the parameter count is undisclosed, precise VRAM figures are not available from the provider. However, based on real-world usage and community reports, the model runs comfortably on consumer GPUs with 8GB or more VRAM. The V2 checkpoint is slightly larger than V1 due to improved training, but both versions run on the same hardware profile.
The model supports zero-shot cross-lingual voice cloning. This means the reference audio and the generated speech can be in different languages, and neither language needs to be in the training dataset. In practice, this works best with the natively supported languages in V2, but the architecture handles unseen languages with acceptable quality.
OpenVoice is purpose-built for instant voice cloning with style control. Its practical applications include:
The model excels at tone color cloning—it preserves the unique qualities of a speaker’s voice better than many alternatives. The style control is genuinely useful: you can generate the same sentence with the same voice but different emotions or accents, which is difficult with end-to-end models that bake style into the voice embedding.
Limitations: Audio quality, while improved in V2, does not match premium commercial services like ElevenLabs. The model requires some technical setup—it is not a plug-and-play product. Accent conversion can be inconsistent, particularly with less common language pairs.
OpenVoice runs on consumer hardware. Here is what you need to know for local deployment:
Minimum hardware: Any GPU with 6GB VRAM can run the V1 model at reasonable speed. V2 requires slightly more—8GB VRAM is the practical minimum for real-time or near-real-time generation.
Recommended hardware: An RTX 3060 (12GB) or RTX 4090 (24GB) provides comfortable headroom. The model does not require high VRAM for inference, but more VRAM allows for larger batch processing and faster generation. On an RTX 4090, you can expect sub-second generation for short sentences and near-instantaneous cloning.
CPU inference: Possible but slow. The model benefits significantly from GPU acceleration. Expect generation times of several seconds per sentence on CPU.
Setup options:
requirements.txt.Performance expectations: Generation speed depends on input length and GPU. On an RTX 4090, expect 5-10 seconds for a 30-word sentence including the cloning step. The cloning itself (processing the reference audio) takes 1-2 seconds. Subsequent generations using the same voice are faster since the tone color is cached.
Quantization: The model uses FP16 by default. INT8 quantization is possible but not officially supported—you would need to modify the inference code. Most users run it at FP16 on consumer GPUs without issues.
vs. Coqui TTS (XTTS-v2): XTTS-v2 is the most direct open-source competitor. It also supports voice cloning with cross-lingual generation. OpenVoice has better tone color accuracy and more granular style control. XTTS-v2 has a larger community and more pre-trained language support. Choose OpenVoice if you need precise timbre replication with style separation. Choose XTTS-v2 if you need broader language coverage out of the box.
vs. ElevenLabs: ElevenLabs delivers superior audio quality and simpler setup. It is a commercial API, not a local model. OpenVoice is free, open-source, and runs entirely on your hardware. If quality is your primary concern and you have budget, ElevenLabs wins. If you need local inference, no API costs, and control over the model, OpenVoice is the better choice.
vs. Meta’s Voicebox: Voicebox is research-only and not publicly available for local use. OpenVoice is the practical choice for anyone who wants to run voice cloning on their own machine today.