Coqui

XTTS v2

Coqui's multilingual zero-shot voice cloning TTS model supporting 17 languages and producing 24 kHz audio.

B paramsDense

View on Hugging Face Source Code Official Page

Model Specifications

ParametersnullB

ArchitectureDense

ProviderCoqui

Download Size24.3 GB

Community

Monthly Downloads7.6M

Likes3.5K

Last Updated2 years ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Coqui Public Model License 1.0.0View Full License

Performance & Scoring

Benchmarks

No benchmark data available for this model yet.

Overall Score

64.7BB

Benchmark40%

50.0

Popularity25%

84.0

Efficiency25%

68.9

Versatility10%

65.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

85 devices


Acer Veriton GN100 AI MiniAcer	SS	0.5 GB
AMD Instinct MI300XAMD	SS	0.5 GB
AMD Instinct MI325XAMD	SS	0.5 GB
AMD Instinct MI355XAMD	SS	0.5 GB
AMD Radeon RX 7600 8GBAMD	SS	0.5 GB
AMD Radeon RX 7700 XTAMD	SS	0.5 GB
AMD Radeon RX 7800 XTAMD	SS	0.5 GB
AMD Radeon RX 7900 XTAMD	SS	0.5 GB
AMD Radeon RX 7900 XTXAMD	SS	0.5 GB
AMD Radeon RX 9070AMD	SS	0.5 GB
AMD Radeon RX 9070 XTAMD	SS	0.5 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	0.5 GB
Apple M4Apple	SS	0.5 GB
Apple M4 Max (40-core GPU)Apple	SS	0.5 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	0.5 GB
Apple M5Apple	SS	0.5 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	0.5 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	0.5 GB
Apple Mac Mini (M1, 2020)Apple	SS	0.5 GB
Apple Mac Mini (M2, 2023)Apple	SS	0.5 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	0.5 GB
Apple Mac Mini (M4, 2024)Apple	SS	0.5 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	0.5 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	0.5 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	SS	0.5 GB

Rows per page

Page 1 of 4

About This Model

Overview

XTTS v2 is Coqui's multilingual text-to-speech model optimized for zero-shot voice cloning. Unlike traditional TTS systems that require hours of studio-quality recordings to clone a voice, XTTS v2 achieves convincing voice replication from a single 6-second audio sample. The model produces 24 kHz audio output and supports 17 languages.

Coqui built XTTS v2 as the successor to their first-generation cross-lingual TTS model. The parameter count is undisclosed, but the architecture is a dense transformer design. Based on inference performance and VRAM consumption, XTTS v2 likely falls in the 1-2 billion parameter range — comparable in compute requirements to models like Bark or Tortoise TTS, but with significantly faster inference and lower latency.

The model's primary advantage is cross-language voice cloning: you can clone a voice from English audio and generate speech in Japanese, Arabic, or any other supported language while preserving the original speaker's characteristics. This capability makes XTTS v2 one of the few practical options for multilingual voice synthesis on local hardware.

Architecture & Technical Details

XTTS v2 uses a dense transformer architecture with undisclosed parameter count. The model builds on the Tortoise TTS architecture with significant modifications to enable cross-lingual generation and faster inference. Key architectural changes from v1 include improved speaker conditioning, support for multiple speaker reference files, and interpolation between different voice embeddings.

The model operates as an autoregressive text-to-speech system. It processes input text through a text encoder, conditions the generation on a speaker embedding extracted from the reference audio, and outputs mel-spectrograms that are converted to 24 kHz waveforms using a separate vocoder. The architecture supports streaming inference with sub-200ms latency in optimized configurations.

For local deployment, the model requires approximately 2-3 GB of VRAM at FP16 precision. This makes it feasible on most consumer GPUs from the last 3-4 years. The model does not specify a context length limit, but practical testing shows it handles paragraphs of 200-300 words reliably before quality degradation becomes noticeable.

Capabilities & Use Cases

XTTS v2's core capability is zero-shot voice cloning from short audio samples. The model requires only 6 seconds of reference audio to generate a usable voice clone, though longer samples (15-30 seconds) produce more consistent results. The cloned voice can then generate speech in any of the 17 supported languages.

Supported languages: English, Spanish, French, German, Italian, Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, Chinese (Mandarin), Japanese, Hungarian, Korean, and Hindi.

Concrete use cases:

Local voice assistants: Clone your own voice for a custom assistant that runs entirely offline
Content localization: Take a single voice recording and generate multilingual audio content without re-recording
Accessibility tools: Generate natural speech from text with consistent voice characteristics across languages
Game development: Create character voices from short reference clips without hiring voice actors for every line
Audiobook production: Clone a narrator's voice and generate long-form audio content locally

The model also supports emotion and style transfer through the cloning process — the emotional tone of the reference audio influences the generated speech. This is not controllable through explicit parameters but emerges from the speaker embedding.

Running XTTS v2 Locally

XTTS v2 is practical to run on consumer hardware. The model is available through the Coqui TTS library and can be deployed via the TTS Python package or through community-maintained integrations.

Minimum hardware requirements:

GPU with 4 GB VRAM (Q4 quantized)
8 GB system RAM
4 GB free disk space

Recommended hardware:

GPU with 6-8 GB VRAM (FP16 precision)
16 GB system RAM
SSD storage

Realistic consumer GPUs that run XTTS v2 well:

RTX 3060 (12 GB): Full FP16 inference with headroom for batch processing
RTX 4060 (8 GB): FP16 inference, comfortable for single-stream generation
RTX 4090 (24 GB): Overkill for inference, useful for fine-tuning
M4 Max (40-core GPU): Runs well via MLX or PyTorch MPS backend
Apple M2/M3 Pro: Viable with quantized models

Performance expectations:

FP16 precision on RTX 4090: ~2-3 seconds for 10 seconds of audio
Q4_K_M quantization on RTX 3060: ~4-6 seconds for 10 seconds of audio
Streaming mode reduces latency to under 200ms for first audio chunk

Recommended quantization: Q4_K_M balances quality and VRAM efficiency. The quality difference between Q4_K_M and FP16 is minimal for speech generation, and the VRAM savings (approximately 2 GB vs 3.5 GB) make the model accessible on more hardware.

Getting started: The quickest path is installing the Coqui TTS package (pip install TTS) and loading the model with tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2"). For containerized deployment, community Docker images are available that bundle the model with optimized inference pipelines.

How It Compares

XTTS v2 vs Bark (Suno): Bark produces more expressive speech with better prosody and can generate non-speech sounds, but it's significantly slower — Bark takes 10-15 seconds per second of audio on consumer GPUs. XTTS v2 is faster by an order of magnitude and produces cleaner audio at 24 kHz vs Bark's 16 kHz. Bark's voice cloning quality is comparable for English but degrades more on cross-language tasks. Choose Bark if you need sound effects or musical expression; choose XTTS v2 for practical multilingual voice cloning.

XTTS v2 vs Piper TTS: Piper is dramatically lighter — it runs on CPU with minimal RAM and achieves real-time or faster synthesis. Piper's voice quality is lower, and it lacks voice cloning entirely. Piper supports 20+ languages but requires separate models per language and per voice. XTTS v2 is the right choice when voice cloning and cross-language generation matter; Piper wins when you need minimal resource usage and don't need custom voices.

XTTS v2 vs ElevenLabs (cloud): ElevenLabs offers superior audio quality and more natural prosody, but it requires internet access and charges per character. XTTS v2 runs entirely offline with no usage costs. For production systems processing high volumes of speech, XTTS v2 at Q4 quantization on a dedicated GPU becomes more cost-effective than API usage within months. The tradeoff is quality — ElevenLabs still leads in naturalness, but XTTS v2 is closing the gap with each iteration.

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.

Coqui

XTTS v2

Coqui's multilingual zero-shot voice cloning TTS model supporting 17 languages and producing 24 kHz audio.

B paramsDense

View on Hugging Face Source Code Official Page

Model Specifications

ParametersnullB

ArchitectureDense

ProviderCoqui

Download Size24.3 GB

Community

Monthly Downloads7.6M

Likes3.5K

Last Updated2 years ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Coqui Public Model License 1.0.0View Full License

Performance & Scoring

Benchmarks

No benchmark data available for this model yet.

Overall Score

64.7BB

Benchmark40%

50.0

Popularity25%

84.0

Efficiency25%

68.9

Versatility10%

65.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

85 devices


Acer Veriton GN100 AI MiniAcer	SS	0.5 GB
AMD Instinct MI300XAMD	SS	0.5 GB
AMD Instinct MI325XAMD	SS	0.5 GB
AMD Instinct MI355XAMD	SS	0.5 GB
AMD Radeon RX 7600 8GBAMD	SS	0.5 GB
AMD Radeon RX 7700 XTAMD	SS	0.5 GB
AMD Radeon RX 7800 XTAMD	SS	0.5 GB
AMD Radeon RX 7900 XTAMD	SS	0.5 GB
AMD Radeon RX 7900 XTXAMD	SS	0.5 GB
AMD Radeon RX 9070AMD	SS	0.5 GB
AMD Radeon RX 9070 XTAMD	SS	0.5 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	0.5 GB
Apple M4Apple	SS	0.5 GB
Apple M4 Max (40-core GPU)Apple	SS	0.5 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	0.5 GB
Apple M5Apple	SS	0.5 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	0.5 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	0.5 GB
Apple Mac Mini (M1, 2020)Apple	SS	0.5 GB
Apple Mac Mini (M2, 2023)Apple	SS	0.5 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	0.5 GB
Apple Mac Mini (M4, 2024)Apple	SS	0.5 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	0.5 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	0.5 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	SS	0.5 GB

Rows per page

Page 1 of 4

About This Model

Overview

Architecture & Technical Details

Capabilities & Use Cases

Supported languages: English, Spanish, French, German, Italian, Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, Chinese (Mandarin), Japanese, Hungarian, Korean, and Hindi.

Concrete use cases:

Local voice assistants: Clone your own voice for a custom assistant that runs entirely offline
Content localization: Take a single voice recording and generate multilingual audio content without re-recording
Accessibility tools: Generate natural speech from text with consistent voice characteristics across languages
Game development: Create character voices from short reference clips without hiring voice actors for every line
Audiobook production: Clone a narrator's voice and generate long-form audio content locally

Running XTTS v2 Locally

XTTS v2 is practical to run on consumer hardware. The model is available through the Coqui TTS library and can be deployed via the TTS Python package or through community-maintained integrations.

Minimum hardware requirements:

GPU with 4 GB VRAM (Q4 quantized)
8 GB system RAM
4 GB free disk space

Recommended hardware:

GPU with 6-8 GB VRAM (FP16 precision)
16 GB system RAM
SSD storage

Realistic consumer GPUs that run XTTS v2 well:

RTX 3060 (12 GB): Full FP16 inference with headroom for batch processing
RTX 4060 (8 GB): FP16 inference, comfortable for single-stream generation
RTX 4090 (24 GB): Overkill for inference, useful for fine-tuning
M4 Max (40-core GPU): Runs well via MLX or PyTorch MPS backend
Apple M2/M3 Pro: Viable with quantized models

Performance expectations:

FP16 precision on RTX 4090: ~2-3 seconds for 10 seconds of audio
Q4_K_M quantization on RTX 3060: ~4-6 seconds for 10 seconds of audio
Streaming mode reduces latency to under 200ms for first audio chunk

How It Compares

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.