Coqui's multilingual zero-shot voice cloning TTS model supporting 17 languages and producing 24 kHz audio.
Access model weights, configuration files, and documentation.
No benchmark data available for this model yet.
See which devices can run this model and at what quality level.
XTTS v2 is Coqui's multilingual text-to-speech model optimized for zero-shot voice cloning. Unlike traditional TTS systems that require hours of studio-quality recordings to clone a voice, XTTS v2 achieves convincing voice replication from a single 6-second audio sample. The model produces 24 kHz audio output and supports 17 languages.
Coqui built XTTS v2 as the successor to their first-generation cross-lingual TTS model. The parameter count is undisclosed, but the architecture is a dense transformer design. Based on inference performance and VRAM consumption, XTTS v2 likely falls in the 1-2 billion parameter range — comparable in compute requirements to models like Bark or Tortoise TTS, but with significantly faster inference and lower latency.
The model's primary advantage is cross-language voice cloning: you can clone a voice from English audio and generate speech in Japanese, Arabic, or any other supported language while preserving the original speaker's characteristics. This capability makes XTTS v2 one of the few practical options for multilingual voice synthesis on local hardware.
XTTS v2 uses a dense transformer architecture with undisclosed parameter count. The model builds on the Tortoise TTS architecture with significant modifications to enable cross-lingual generation and faster inference. Key architectural changes from v1 include improved speaker conditioning, support for multiple speaker reference files, and interpolation between different voice embeddings.
The model operates as an autoregressive text-to-speech system. It processes input text through a text encoder, conditions the generation on a speaker embedding extracted from the reference audio, and outputs mel-spectrograms that are converted to 24 kHz waveforms using a separate vocoder. The architecture supports streaming inference with sub-200ms latency in optimized configurations.
For local deployment, the model requires approximately 2-3 GB of VRAM at FP16 precision. This makes it feasible on most consumer GPUs from the last 3-4 years. The model does not specify a context length limit, but practical testing shows it handles paragraphs of 200-300 words reliably before quality degradation becomes noticeable.
XTTS v2's core capability is zero-shot voice cloning from short audio samples. The model requires only 6 seconds of reference audio to generate a usable voice clone, though longer samples (15-30 seconds) produce more consistent results. The cloned voice can then generate speech in any of the 17 supported languages.
Supported languages: English, Spanish, French, German, Italian, Portuguese, Polish, Turkish, Russian, Dutch, Czech, Arabic, Chinese (Mandarin), Japanese, Hungarian, Korean, and Hindi.
Concrete use cases:
The model also supports emotion and style transfer through the cloning process — the emotional tone of the reference audio influences the generated speech. This is not controllable through explicit parameters but emerges from the speaker embedding.
XTTS v2 is practical to run on consumer hardware. The model is available through the Coqui TTS library and can be deployed via the TTS Python package or through community-maintained integrations.
Minimum hardware requirements:
Recommended hardware:
Realistic consumer GPUs that run XTTS v2 well:
Performance expectations:
Recommended quantization: Q4_K_M balances quality and VRAM efficiency. The quality difference between Q4_K_M and FP16 is minimal for speech generation, and the VRAM savings (approximately 2 GB vs 3.5 GB) make the model accessible on more hardware.
Getting started: The quickest path is installing the Coqui TTS package (pip install TTS) and loading the model with tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2"). For containerized deployment, community Docker images are available that bundle the model with optimized inference pipelines.
XTTS v2 vs Bark (Suno): Bark produces more expressive speech with better prosody and can generate non-speech sounds, but it's significantly slower — Bark takes 10-15 seconds per second of audio on consumer GPUs. XTTS v2 is faster by an order of magnitude and produces cleaner audio at 24 kHz vs Bark's 16 kHz. Bark's voice cloning quality is comparable for English but degrades more on cross-language tasks. Choose Bark if you need sound effects or musical expression; choose XTTS v2 for practical multilingual voice cloning.
XTTS v2 vs Piper TTS: Piper is dramatically lighter — it runs on CPU with minimal RAM and achieves real-time or faster synthesis. Piper's voice quality is lower, and it lacks voice cloning entirely. Piper supports 20+ languages but requires separate models per language and per voice. XTTS v2 is the right choice when voice cloning and cross-language generation matter; Piper wins when you need minimal resource usage and don't need custom voices.
XTTS v2 vs ElevenLabs (cloud): ElevenLabs offers superior audio quality and more natural prosody, but it requires internet access and charges per character. XTTS v2 runs entirely offline with no usage costs. For production systems processing high volumes of speech, XTTS v2 at Q4 quantization on a dedicated GPU becomes more cost-effective than API usage within months. The tradeoff is quality — ElevenLabs still leads in naturalness, but XTTS v2 is closing the gap with each iteration.