MyShell.ai's open-source instant voice cloning model with multilingual base speakers and a tone-color converter.
Access model weights, configuration files, and documentation.
No benchmark data available for this model yet.
See which devices can run this model and at what quality level.
OpenVoice V2 is an open-source instant voice cloning model developed by MyShell AI and MIT. Released in April 2024 under the MIT License, it enables zero-shot voice cloning from a short audio reference — no fine-tuning or training required. The model uses a dense architecture with undisclosed parameters, making it a practical option for developers who want to run voice cloning entirely on their own hardware without cloud dependencies.
What sets OpenVoice V2 apart from other TTS models is its separation of tone color and style control. Most voice cloning systems tie the speaker's voice characteristics to the delivery style. OpenVoice V2 decouples these two dimensions, allowing you to clone a specific voice while independently controlling emotion, accent, rhythm, and intonation. This makes it useful for applications where you need a consistent voice identity across varied speaking contexts.
The model natively supports six languages: English, Spanish, French, Chinese, Japanese, and Korean. It can clone voices from any input language and generate speech in any supported language, regardless of whether the reference audio matches the output language. This cross-lingual capability works in zero-shot mode — the model generalizes to language pairs it may not have explicitly seen during training.
OpenVoice V2 is a text-to-speech model built on a dense architecture. The parameter count is undisclosed, but the model is designed to run on consumer hardware. It uses a two-stage pipeline: a tone color converter extracts the speaker's voice characteristics from a reference audio sample, while a separate base speaker model handles the linguistic and prosodic generation. This separation is what enables independent control over voice identity and style.
The tone color converter operates on mel-spectrogram representations. It captures the timbral qualities of the reference speaker — pitch range, resonance, vocal tract characteristics — and maps them onto the base speaker's output. The base speaker provides the language-specific phonetics and prosody for each of the six supported languages.
Because parameters are undisclosed, you cannot estimate VRAM requirements from parameter count alone. However, the model's practical behavior on consumer hardware is well-documented by the community. The model does not specify a context length, which is typical for TTS models — inference is driven by input text length and audio duration rather than token context windows.
OpenVoice V2's core capability is accurate tone color cloning with fine-grained style control. This translates to several practical applications:
The model does not support real-time streaming out of the box. It processes text input and generates audio output as a complete inference pass. For production deployments, you would need to handle streaming at the application layer.
OpenVoice V2 runs on consumer hardware, but you need to understand the resource requirements to avoid surprises. The model is not quantized by default, and community quantization support varies.
Inference speed depends on output audio length and hardware. On an RTX 4090, expect approximately 2-4 seconds of audio generated per second of wall-clock time for short clips (under 30 seconds). On an RTX 3060 with 12 GB, expect roughly 0.5-1 second of audio per second of processing time. These figures vary significantly based on audio sample rate and output length.
OpenVoice V2 does not have widely standardized quantization formats like GGUF or GPTQ. The model uses PyTorch checkpoints with FP16 weights. You can convert to FP16 or INT8 using standard PyTorch quantization, but community-supported quantized versions are limited. For most users, running the model in FP16 on a GPU with 12 GB VRAM is the most reliable approach.
The quickest way to run OpenVoice V2 locally is to clone the [official GitHub repository](https://github.com/myshell-ai/OpenVoice) and follow the Linux installation instructions. The model requires Python 3.9, PyTorch, and the dependencies listed in requirements.txt. The repository includes Jupyter notebooks (demo_part1.ipynb, demo_part2.ipynb, demo_part3.ipynb) that walk through voice cloning, style control, and cross-lingual generation.
For non-Linux platforms, community installation guides exist but are not officially maintained. Windows users will need WSL2 or a Docker container. macOS users on Apple Silicon can run the model using PyTorch's MPS backend, though performance is lower than CUDA.
If you are constrained on VRAM:
OpenVoice V2 vs. ElevenLabs: ElevenLabs offers higher audio quality and more polished output, but it is a cloud-only service with usage-based pricing. OpenVoice V2 runs entirely offline, costs nothing per inference, and gives you full control over the pipeline. If you need production-grade audio quality and have budget for API costs, ElevenLabs is the better choice. If you need local inference, privacy, or unlimited usage, OpenVoice V2 wins.
OpenVoice V2 vs. Coqui TTS: Coqui TTS provides similar voice cloning capabilities with a broader range of pretrained models. Coqui has more active community development and better quantization support. OpenVoice V2's advantage is its native multilingual support with consistent quality across six languages, whereas Coqui's multilingual models vary in quality depending on language. OpenVoice V2 also offers more granular style control through its tone-color separation architecture.
Choose OpenVoice V2 when you need cross-lingual voice cloning with independent style control, and you want to run everything on your own hardware under the MIT License.