MyShell AI

OpenVoice

Versatile instant voice-cloning TTS that replicates a reference speaker's tone color from a short audio clip and supports cross-lingual generation.

B paramsDense

View on Hugging Face Source Code Official Page

Model Specifications

ParametersnullB

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

MITView Full License

Performance & Scoring

Benchmarks

No benchmark data available for this model yet.

Overall Score

48.1CC

Benchmark40%

50.0

Popularity25%

21.3

Efficiency25%

68.9

Versatility10%

55.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

85 devices


Acer Veriton GN100 AI MiniAcer	SS	0.5 GB
AMD Instinct MI300XAMD	SS	0.5 GB
AMD Instinct MI325XAMD	SS	0.5 GB
AMD Instinct MI355XAMD	SS	0.5 GB
AMD Radeon RX 7600 8GBAMD	SS	0.5 GB
AMD Radeon RX 7700 XTAMD	SS	0.5 GB
AMD Radeon RX 7800 XTAMD	SS	0.5 GB
AMD Radeon RX 7900 XTAMD	SS	0.5 GB
AMD Radeon RX 7900 XTXAMD	SS	0.5 GB
AMD Radeon RX 9070AMD	SS	0.5 GB
AMD Radeon RX 9070 XTAMD	SS	0.5 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	0.5 GB
Apple M4Apple	SS	0.5 GB
Apple M4 Max (40-core GPU)Apple	SS	0.5 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	0.5 GB
Apple M5Apple	SS	0.5 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	0.5 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	0.5 GB
Apple Mac Mini (M1, 2020)Apple	SS	0.5 GB
Apple Mac Mini (M2, 2023)Apple	SS	0.5 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	0.5 GB
Apple Mac Mini (M4, 2024)Apple	SS	0.5 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	0.5 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	0.5 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	SS	0.5 GB

Rows per page

Page 1 of 4

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.

About This Model

Overview

OpenVoice is an instant voice cloning text-to-speech (TTS) model developed by MyShell AI in collaboration with MIT and Tsinghua University. It replicates a reference speaker’s tone color from a short audio clip—typically 30 seconds or less—and generates speech in multiple languages while preserving the cloned voice. The model is released under the MIT license, making it free for commercial use.

What sets OpenVoice apart from other TTS models is its separation of tone color from voice style. Most voice cloning systems treat voice as a monolithic attribute. OpenVoice decouples them, giving you granular control over emotion, accent, rhythm, pauses, and intonation independently of the cloned voice. This means you can clone a speaker’s voice and make it sound happy, sad, or accented without losing the original timbre.

The architecture is dense with undisclosed parameters. MyShell AI has not published the parameter count, which is common for commercial TTS models. It is text-only in terms of input modality—you provide text and a reference audio clip, and it outputs synthesized speech. The context length is not specified, but in practice, it handles typical sentence-to-paragraph-length inputs without issue.

OpenVoice has been powering MyShell’s instant voice cloning since May 2023, processing tens of millions of voice cloning requests. V2, released in April 2024, improves audio quality and adds native support for English, Spanish, French, Chinese, Japanese, and Korean.

Architecture & Technical Details

OpenVoice uses a two-stage architecture that separates tone color cloning from style control. The first stage extracts the reference speaker’s tone color from the short audio sample. The second stage generates speech in the target language while applying the extracted tone color and any specified style parameters.

This separation is computationally efficient. According to the authors, OpenVoice costs “tens of times less” than commercial APIs that deliver inferior performance. The model runs on a single GPU and does not require the massive compute resources typical of end-to-end neural TTS systems.

Because the parameter count is undisclosed, precise VRAM figures are not available from the provider. However, based on real-world usage and community reports, the model runs comfortably on consumer GPUs with 8GB or more VRAM. The V2 checkpoint is slightly larger than V1 due to improved training, but both versions run on the same hardware profile.

The model supports zero-shot cross-lingual voice cloning. This means the reference audio and the generated speech can be in different languages, and neither language needs to be in the training dataset. In practice, this works best with the natively supported languages in V2, but the architecture handles unseen languages with acceptable quality.

Capabilities & Use Cases

OpenVoice is purpose-built for instant voice cloning with style control. Its practical applications include:

Content localization: Clone a voice actor’s performance and generate dialogue in multiple languages while maintaining their vocal identity. Useful for dubbing, game localization, and multilingual audiobook production.
Assistive communication: Generate personalized synthetic voices for individuals who have lost their ability to speak. A short recording of their natural voice is sufficient to create a functional TTS voice.
Interactive voice agents: Build voice interfaces that adapt to user preferences. Clone a user’s voice for personalized responses or generate character voices for interactive fiction.
Audio content production: Generate voiceovers with consistent timbre across multiple takes, with control over emotional delivery and accent.

The model excels at tone color cloning—it preserves the unique qualities of a speaker’s voice better than many alternatives. The style control is genuinely useful: you can generate the same sentence with the same voice but different emotions or accents, which is difficult with end-to-end models that bake style into the voice embedding.

Limitations: Audio quality, while improved in V2, does not match premium commercial services like ElevenLabs. The model requires some technical setup—it is not a plug-and-play product. Accent conversion can be inconsistent, particularly with less common language pairs.

Running OpenVoice Locally

OpenVoice runs on consumer hardware. Here is what you need to know for local deployment:

Minimum hardware: Any GPU with 6GB VRAM can run the V1 model at reasonable speed. V2 requires slightly more—8GB VRAM is the practical minimum for real-time or near-real-time generation.

Recommended hardware: An RTX 3060 (12GB) or RTX 4090 (24GB) provides comfortable headroom. The model does not require high VRAM for inference, but more VRAM allows for larger batch processing and faster generation. On an RTX 4090, you can expect sub-second generation for short sentences and near-instantaneous cloning.

CPU inference: Possible but slow. The model benefits significantly from GPU acceleration. Expect generation times of several seconds per sentence on CPU.

Setup options:

The official GitHub repository provides setup scripts for Linux. Requires Python 3.9, PyTorch, and the dependencies in requirements.txt.
A Hugging Face Space is available for testing without local setup.
The model is not yet available through Ollama or similar model runners as of early 2025. You will need to run it directly from the repository.

Performance expectations: Generation speed depends on input length and GPU. On an RTX 4090, expect 5-10 seconds for a 30-word sentence including the cloning step. The cloning itself (processing the reference audio) takes 1-2 seconds. Subsequent generations using the same voice are faster since the tone color is cached.

Quantization: The model uses FP16 by default. INT8 quantization is possible but not officially supported—you would need to modify the inference code. Most users run it at FP16 on consumer GPUs without issues.

How It Compares

vs. Coqui TTS (XTTS-v2): XTTS-v2 is the most direct open-source competitor. It also supports voice cloning with cross-lingual generation. OpenVoice has better tone color accuracy and more granular style control. XTTS-v2 has a larger community and more pre-trained language support. Choose OpenVoice if you need precise timbre replication with style separation. Choose XTTS-v2 if you need broader language coverage out of the box.

vs. ElevenLabs: ElevenLabs delivers superior audio quality and simpler setup. It is a commercial API, not a local model. OpenVoice is free, open-source, and runs entirely on your hardware. If quality is your primary concern and you have budget, ElevenLabs wins. If you need local inference, no API costs, and control over the model, OpenVoice is the better choice.

vs. Meta’s Voicebox: Voicebox is research-only and not publicly available for local use. OpenVoice is the practical choice for anyone who wants to run voice cloning on their own machine today.

OpenVoice

Model Specifications

Quick Start

Download from Hugging Face

License

Performance & Scoring

Benchmarks

Overall Score

Hardware Compatibility

Find the best hardware for this model

Community

About This Model

Overview

Architecture & Technical Details

Capabilities & Use Cases

Running OpenVoice Locally

How It Compares

Related Models

OpenVoice V2

MeloTTS