MyShell AI

OpenVoice V2

MyShell.ai's open-source instant voice cloning model with multilingual base speakers and a tone-color converter.

B paramsDense

View on Hugging Face Source Code Official Page

Model Specifications

ParametersnullB

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

MITView Full License

Performance & Scoring

Benchmarks

No benchmark data available for this model yet.

Overall Score

47.9CC

Benchmark40%

50.0

Popularity25%

20.7

Efficiency25%

68.9

Versatility10%

55.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

85 devices


Acer Veriton GN100 AI MiniAcer	SS	0.5 GB
AMD Instinct MI300XAMD	SS	0.5 GB
AMD Instinct MI325XAMD	SS	0.5 GB
AMD Instinct MI355XAMD	SS	0.5 GB
AMD Radeon RX 7600 8GBAMD	SS	0.5 GB
AMD Radeon RX 7700 XTAMD	SS	0.5 GB
AMD Radeon RX 7800 XTAMD	SS	0.5 GB
AMD Radeon RX 7900 XTAMD	SS	0.5 GB
AMD Radeon RX 7900 XTXAMD	SS	0.5 GB
AMD Radeon RX 9070AMD	SS	0.5 GB
AMD Radeon RX 9070 XTAMD	SS	0.5 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	0.5 GB
Apple M4Apple	SS	0.5 GB
Apple M4 Max (40-core GPU)Apple	SS	0.5 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	0.5 GB
Apple M5Apple	SS	0.5 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	0.5 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	0.5 GB
Apple Mac Mini (M1, 2020)Apple	SS	0.5 GB
Apple Mac Mini (M2, 2023)Apple	SS	0.5 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	0.5 GB
Apple Mac Mini (M4, 2024)Apple	SS	0.5 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	0.5 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	0.5 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	SS	0.5 GB

Rows per page

Page 1 of 4

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.

About This Model

Overview

OpenVoice V2 is an open-source instant voice cloning model developed by MyShell AI and MIT. Released in April 2024 under the MIT License, it enables zero-shot voice cloning from a short audio reference — no fine-tuning or training required. The model uses a dense architecture with undisclosed parameters, making it a practical option for developers who want to run voice cloning entirely on their own hardware without cloud dependencies.

What sets OpenVoice V2 apart from other TTS models is its separation of tone color and style control. Most voice cloning systems tie the speaker's voice characteristics to the delivery style. OpenVoice V2 decouples these two dimensions, allowing you to clone a specific voice while independently controlling emotion, accent, rhythm, and intonation. This makes it useful for applications where you need a consistent voice identity across varied speaking contexts.

The model natively supports six languages: English, Spanish, French, Chinese, Japanese, and Korean. It can clone voices from any input language and generate speech in any supported language, regardless of whether the reference audio matches the output language. This cross-lingual capability works in zero-shot mode — the model generalizes to language pairs it may not have explicitly seen during training.

Architecture & Technical Details

OpenVoice V2 is a text-to-speech model built on a dense architecture. The parameter count is undisclosed, but the model is designed to run on consumer hardware. It uses a two-stage pipeline: a tone color converter extracts the speaker's voice characteristics from a reference audio sample, while a separate base speaker model handles the linguistic and prosodic generation. This separation is what enables independent control over voice identity and style.

The tone color converter operates on mel-spectrogram representations. It captures the timbral qualities of the reference speaker — pitch range, resonance, vocal tract characteristics — and maps them onto the base speaker's output. The base speaker provides the language-specific phonetics and prosody for each of the six supported languages.

Because parameters are undisclosed, you cannot estimate VRAM requirements from parameter count alone. However, the model's practical behavior on consumer hardware is well-documented by the community. The model does not specify a context length, which is typical for TTS models — inference is driven by input text length and audio duration rather than token context windows.

Capabilities & Use Cases

OpenVoice V2's core capability is accurate tone color cloning with fine-grained style control. This translates to several practical applications:

Content creation: Generate consistent voiceovers for video, podcasts, or audiobooks using a cloned voice, with the ability to adjust emotional tone per segment.
Language learning: Clone a learner's voice and generate pronunciation examples in multiple languages, maintaining their natural voice identity.
Voice preservation: Archive a specific voice by creating a digital clone from short recordings, then generate new speech in that voice across languages.
Accessibility: Generate personalized synthetic voices for individuals who have lost their natural speech, using minimal reference audio.
Localization: Dub content into multiple languages while preserving the original speaker's voice characteristics.

The model does not support real-time streaming out of the box. It processes text input and generates audio output as a complete inference pass. For production deployments, you would need to handle streaming at the application layer.

Running OpenVoice V2 Locally

OpenVoice V2 runs on consumer hardware, but you need to understand the resource requirements to avoid surprises. The model is not quantized by default, and community quantization support varies.

Hardware Requirements

Minimum: 8 GB VRAM (RTX 3070, RTX 4060) — runs at reduced batch sizes, acceptable for single-sentence generation.
Recommended: 12-16 GB VRAM (RTX 4070 Ti, RTX 4080, RTX 3090) — comfortable for paragraph-length generation with reasonable latency.
Ideal: 24 GB VRAM (RTX 4090, RTX 3090, A5000) — allows batch processing and longer audio generation.
Apple Silicon: M2 Pro or better with 16 GB unified memory. M4 Max handles the model without issue.

Performance Expectations

Inference speed depends on output audio length and hardware. On an RTX 4090, expect approximately 2-4 seconds of audio generated per second of wall-clock time for short clips (under 30 seconds). On an RTX 3060 with 12 GB, expect roughly 0.5-1 second of audio per second of processing time. These figures vary significantly based on audio sample rate and output length.

Quantization

OpenVoice V2 does not have widely standardized quantization formats like GGUF or GPTQ. The model uses PyTorch checkpoints with FP16 weights. You can convert to FP16 or INT8 using standard PyTorch quantization, but community-supported quantized versions are limited. For most users, running the model in FP16 on a GPU with 12 GB VRAM is the most reliable approach.

Getting Started

The quickest way to run OpenVoice V2 locally is to clone the [official GitHub repository](https://github.com/myshell-ai/OpenVoice) and follow the Linux installation instructions. The model requires Python 3.9, PyTorch, and the dependencies listed in requirements.txt. The repository includes Jupyter notebooks (demo_part1.ipynb, demo_part2.ipynb, demo_part3.ipynb) that walk through voice cloning, style control, and cross-lingual generation.

For non-Linux platforms, community installation guides exist but are not officially maintained. Windows users will need WSL2 or a Docker container. macOS users on Apple Silicon can run the model using PyTorch's MPS backend, though performance is lower than CUDA.

VRAM Optimization

If you are constrained on VRAM:

Reduce the audio sample rate from 24 kHz to 16 kHz
Limit generated audio to shorter segments (under 15 seconds)
Use FP16 inference instead of FP32
Process one utterance at a time rather than batching

How It Compares

OpenVoice V2 vs. ElevenLabs: ElevenLabs offers higher audio quality and more polished output, but it is a cloud-only service with usage-based pricing. OpenVoice V2 runs entirely offline, costs nothing per inference, and gives you full control over the pipeline. If you need production-grade audio quality and have budget for API costs, ElevenLabs is the better choice. If you need local inference, privacy, or unlimited usage, OpenVoice V2 wins.

OpenVoice V2 vs. Coqui TTS: Coqui TTS provides similar voice cloning capabilities with a broader range of pretrained models. Coqui has more active community development and better quantization support. OpenVoice V2's advantage is its native multilingual support with consistent quality across six languages, whereas Coqui's multilingual models vary in quality depending on language. OpenVoice V2 also offers more granular style control through its tone-color separation architecture.

Choose OpenVoice V2 when you need cross-lingual voice cloning with independent style control, and you want to run everything on your own hardware under the MIT License.

OpenVoice V2

Model Specifications

Quick Start

Download from Hugging Face

License

Performance & Scoring

Benchmarks

Overall Score

Hardware Compatibility

Find the best hardware for this model

Community

About This Model

Overview

Architecture & Technical Details

Capabilities & Use Cases

Running OpenVoice V2 Locally

Hardware Requirements

Performance Expectations

Quantization

Getting Started

VRAM Optimization

How It Compares

Related Models

OpenVoice

MeloTTS