Human-level multi-speaker TTS that uses style diffusion and adversarial training with a large pre-trained speech language model.
Access model weights, configuration files, and documentation.
No benchmark data available for this model yet.
See which devices can run this model and at what quality level.
StyleTTS 2 is a text-to-speech model developed by Yinghao Aaron Li and colleagues at Columbia University, released under the MIT license. It represents a significant advance in local TTS synthesis, achieving what the authors describe as human-level performance on both single-speaker and multi-speaker benchmarks. The model uses a dense architecture with undisclosed parameters, and its primary modality is text-to-speech generation.
What sets StyleTTS 2 apart from earlier TTS models is its approach to style generation. Unlike models that require a reference audio sample to clone a speaking style, StyleTTS 2 infers the appropriate prosody, intonation, and rhythm directly from the input text. This makes it practical for applications where you don't have a reference speaker to emulate. The model uses style diffusion—a latent diffusion process that generates a style embedding conditioned on the text—combined with adversarial training using large pre-trained speech language models (SLMs) like WavLM as discriminators.
The model was published at NeurIPS 2023 and has been validated against human recordings. On the LJSpeech single-speaker dataset, StyleTTS 2 surpassed ground-truth recordings in naturalness as judged by native English speakers. On the VCTK multi-speaker dataset, it matched human recordings. These results place it in a category with systems like NaturalSpeech and Vall-E, but with the advantage of being publicly available with an open-source codebase.
StyleTTS 2 uses a dense architecture with undisclosed parameters. While the exact parameter count isn't published, the model's memory footprint and inference speed suggest it operates in a range that is feasible on consumer hardware—unlike some large-scale TTS systems that require server-grade GPUs.
The architecture consists of several key components:
The context length is not specified, but the model has been demonstrated on long-form narration tasks, suggesting it can handle paragraph-length inputs without degradation. The text-only modality means you feed it plain text and get audio output—no multi-modal inputs required.
StyleTTS 2 excels at generating natural-sounding speech from text across multiple scenarios:
Single-speaker synthesis: On the LJSpeech benchmark, the model produces speech that native English speakers rated as more natural than actual human recordings. This makes it suitable for audiobook narration, voice assistants, and content creation where consistent voice quality matters.
Multi-speaker synthesis: The VCTK results show the model can handle multiple speakers with naturalness matching human recordings. This is useful for dialogue systems, character voices in games, or any application requiring distinct speaker identities.
Zero-shot speaker adaptation: When trained on LibriTTS, StyleTTS 2 outperforms previous publicly available models for adapting to new speakers without fine-tuning. You can provide a short sample of a target speaker and generate speech in that voice from arbitrary text.
Long-form narration: The model handles extended text inputs, making it practical for generating podcasts, lecture recordings, or automated voiceovers for video content.
Speech expressiveness: Because the style diffusion module generates appropriate prosody from text alone, the output includes natural variations in pitch, rhythm, and emphasis—avoiding the monotone quality of older TTS systems.
StyleTTS 2 runs on consumer hardware, though requirements depend on whether you're running inference only or fine-tuning.
Minimum hardware for inference: A GPU with 4GB VRAM can run the model at reduced precision. This covers most modern GPUs including the RTX 3060, RTX 4060, and equivalent AMD cards. On an RTX 4090, you can expect real-time or faster synthesis—generating several seconds of audio per second of processing time.
Recommended hardware: For comfortable inference with headroom, 8GB VRAM is sufficient. An RTX 3070, RTX 4070, or M4 Max with adequate unified memory will handle the model without issues. The model does not require the memory bandwidth of large language models, so even mid-range GPUs perform well.
Quantization: The model benefits from FP16 inference, which halves memory usage compared to FP32 with negligible quality loss. INT8 quantization is possible but may introduce audible artifacts in the generated speech. For most users, FP16 is the sweet spot.
Performance: Tokens per second is not the relevant metric here—StyleTTS 2 generates audio waveforms directly. On an RTX 4090, you can expect to generate 5-10 seconds of audio per second of wall time for single-speaker synthesis. Multi-speaker and zero-shot modes are slightly slower but remain real-time capable.
Getting started: The official GitHub repository provides training and inference scripts. For quick local deployment, you can clone the repo, install the dependencies listed in requirements.txt, and run the provided demo scripts. The model weights are available through the repository, and there are Colab notebooks for testing without local setup.
VRAM requirements breakdown:
StyleTTS 2 vs. VITS: VITS is a popular end-to-end TTS model that also uses adversarial training and variational inference. StyleTTS 2 improves on VITS in two key areas: style diversity (through the diffusion-based style module) and naturalness (through the SLM discriminator). In head-to-head comparisons on the LJSpeech dataset, StyleTTS 2 outperforms VITS by a significant margin in naturalness ratings. However, VITS has a larger ecosystem of pre-trained models and community support, making it easier to get started for some use cases.
StyleTTS 2 vs. NaturalSpeech: NaturalSpeech (from Microsoft) achieves comparable quality but is not publicly available with an open-source license. StyleTTS 2 offers similar or better performance with full access to the model weights and training code. The tradeoff is that NaturalSpeech may have more extensive optimization for production deployment, while StyleTTS 2 gives you full control over the training pipeline.
When to choose StyleTTS 2: You need high-quality TTS that generates appropriate speaking styles from text alone, without requiring reference audio. You want an open-source model with MIT licensing that you can modify and deploy freely. You're working on single-speaker or multi-speaker applications where naturalness is the primary requirement.
When to consider alternatives: You need a model with explicit speaker cloning from short audio samples (StyleTTS 2 supports this but some dedicated speaker adaptation models may be more robust). You require a model with documented real-time streaming capabilities for low-latency applications. You need a model with a larger community and more pre-trained checkpoints available.