Columbia University (Yinghao Aaron Li et al.)

StyleTTS 2

Human-level multi-speaker TTS that uses style diffusion and adversarial training with a large pre-trained speech language model.

B paramsDense

View on Hugging Face Source Code Official Page

Model Specifications

ParametersnullB

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

MITView Full License

Performance & Scoring

Benchmarks

No benchmark data available for this model yet.

Overall Score

44.9CC

Benchmark40%

50.0

Popularity25%

4.7

Efficiency25%

68.9

Versatility10%

65.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

85 devices


Acer Veriton GN100 AI MiniAcer	SS	0.5 GB
AMD Instinct MI300XAMD	SS	0.5 GB
AMD Instinct MI325XAMD	SS	0.5 GB
AMD Instinct MI355XAMD	SS	0.5 GB
AMD Radeon RX 7600 8GBAMD	SS	0.5 GB
AMD Radeon RX 7700 XTAMD	SS	0.5 GB
AMD Radeon RX 7800 XTAMD	SS	0.5 GB
AMD Radeon RX 7900 XTAMD	SS	0.5 GB
AMD Radeon RX 7900 XTXAMD	SS	0.5 GB
AMD Radeon RX 9070AMD	SS	0.5 GB
AMD Radeon RX 9070 XTAMD	SS	0.5 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	0.5 GB
Apple M4Apple	SS	0.5 GB
Apple M4 Max (40-core GPU)Apple	SS	0.5 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	0.5 GB
Apple M5Apple	SS	0.5 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	0.5 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	0.5 GB
Apple Mac Mini (M1, 2020)Apple	SS	0.5 GB
Apple Mac Mini (M2, 2023)Apple	SS	0.5 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	0.5 GB
Apple Mac Mini (M4, 2024)Apple	SS	0.5 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	0.5 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	0.5 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	SS	0.5 GB

Rows per page

Page 1 of 4

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.

About This Model

Overview

StyleTTS 2 is a text-to-speech model developed by Yinghao Aaron Li and colleagues at Columbia University, released under the MIT license. It represents a significant advance in local TTS synthesis, achieving what the authors describe as human-level performance on both single-speaker and multi-speaker benchmarks. The model uses a dense architecture with undisclosed parameters, and its primary modality is text-to-speech generation.

What sets StyleTTS 2 apart from earlier TTS models is its approach to style generation. Unlike models that require a reference audio sample to clone a speaking style, StyleTTS 2 infers the appropriate prosody, intonation, and rhythm directly from the input text. This makes it practical for applications where you don't have a reference speaker to emulate. The model uses style diffusion—a latent diffusion process that generates a style embedding conditioned on the text—combined with adversarial training using large pre-trained speech language models (SLMs) like WavLM as discriminators.

The model was published at NeurIPS 2023 and has been validated against human recordings. On the LJSpeech single-speaker dataset, StyleTTS 2 surpassed ground-truth recordings in naturalness as judged by native English speakers. On the VCTK multi-speaker dataset, it matched human recordings. These results place it in a category with systems like NaturalSpeech and Vall-E, but with the advantage of being publicly available with an open-source codebase.

Architecture & Technical Details

StyleTTS 2 uses a dense architecture with undisclosed parameters. While the exact parameter count isn't published, the model's memory footprint and inference speed suggest it operates in a range that is feasible on consumer hardware—unlike some large-scale TTS systems that require server-grade GPUs.

The architecture consists of several key components:

Style Diffusion Module: A latent diffusion model that generates style embeddings from text. This replaces the reference-encoder approach used in the original StyleTTS, allowing the model to produce appropriate speaking styles without a reference audio clip.
Differentiable Duration Modeling: A novel component that enables end-to-end training by making the duration prediction process differentiable. This eliminates the need for separate duration predictors and allows the entire system to be optimized jointly.
Adversarial Training with SLMs: The discriminator is a large pre-trained speech language model (WavLM) rather than a small convolutional network. This provides richer feedback during training, resulting in more natural speech output.

The context length is not specified, but the model has been demonstrated on long-form narration tasks, suggesting it can handle paragraph-length inputs without degradation. The text-only modality means you feed it plain text and get audio output—no multi-modal inputs required.

Capabilities & Use Cases

StyleTTS 2 excels at generating natural-sounding speech from text across multiple scenarios:

Single-speaker synthesis: On the LJSpeech benchmark, the model produces speech that native English speakers rated as more natural than actual human recordings. This makes it suitable for audiobook narration, voice assistants, and content creation where consistent voice quality matters.

Multi-speaker synthesis: The VCTK results show the model can handle multiple speakers with naturalness matching human recordings. This is useful for dialogue systems, character voices in games, or any application requiring distinct speaker identities.

Zero-shot speaker adaptation: When trained on LibriTTS, StyleTTS 2 outperforms previous publicly available models for adapting to new speakers without fine-tuning. You can provide a short sample of a target speaker and generate speech in that voice from arbitrary text.

Long-form narration: The model handles extended text inputs, making it practical for generating podcasts, lecture recordings, or automated voiceovers for video content.

Speech expressiveness: Because the style diffusion module generates appropriate prosody from text alone, the output includes natural variations in pitch, rhythm, and emphasis—avoiding the monotone quality of older TTS systems.

Running StyleTTS 2 Locally

StyleTTS 2 runs on consumer hardware, though requirements depend on whether you're running inference only or fine-tuning.

Minimum hardware for inference: A GPU with 4GB VRAM can run the model at reduced precision. This covers most modern GPUs including the RTX 3060, RTX 4060, and equivalent AMD cards. On an RTX 4090, you can expect real-time or faster synthesis—generating several seconds of audio per second of processing time.

Recommended hardware: For comfortable inference with headroom, 8GB VRAM is sufficient. An RTX 3070, RTX 4070, or M4 Max with adequate unified memory will handle the model without issues. The model does not require the memory bandwidth of large language models, so even mid-range GPUs perform well.

Quantization: The model benefits from FP16 inference, which halves memory usage compared to FP32 with negligible quality loss. INT8 quantization is possible but may introduce audible artifacts in the generated speech. For most users, FP16 is the sweet spot.

Performance: Tokens per second is not the relevant metric here—StyleTTS 2 generates audio waveforms directly. On an RTX 4090, you can expect to generate 5-10 seconds of audio per second of wall time for single-speaker synthesis. Multi-speaker and zero-shot modes are slightly slower but remain real-time capable.

Getting started: The official GitHub repository provides training and inference scripts. For quick local deployment, you can clone the repo, install the dependencies listed in requirements.txt, and run the provided demo scripts. The model weights are available through the repository, and there are Colab notebooks for testing without local setup.

VRAM requirements breakdown:

FP32 inference: ~6-8GB VRAM
FP16 inference: ~3-4GB VRAM
Training/fine-tuning: 8-12GB VRAM recommended

How It Compares

StyleTTS 2 vs. VITS: VITS is a popular end-to-end TTS model that also uses adversarial training and variational inference. StyleTTS 2 improves on VITS in two key areas: style diversity (through the diffusion-based style module) and naturalness (through the SLM discriminator). In head-to-head comparisons on the LJSpeech dataset, StyleTTS 2 outperforms VITS by a significant margin in naturalness ratings. However, VITS has a larger ecosystem of pre-trained models and community support, making it easier to get started for some use cases.

StyleTTS 2 vs. NaturalSpeech: NaturalSpeech (from Microsoft) achieves comparable quality but is not publicly available with an open-source license. StyleTTS 2 offers similar or better performance with full access to the model weights and training code. The tradeoff is that NaturalSpeech may have more extensive optimization for production deployment, while StyleTTS 2 gives you full control over the training pipeline.

When to choose StyleTTS 2: You need high-quality TTS that generates appropriate speaking styles from text alone, without requiring reference audio. You want an open-source model with MIT licensing that you can modify and deploy freely. You're working on single-speaker or multi-speaker applications where naturalness is the primary requirement.

When to consider alternatives: You need a model with explicit speaker cloning from short audio samples (StyleTTS 2 supports this but some dedicated speaker adaptation models may be more robust). You require a model with documented real-time streaming capabilities for low-latency applications. You need a model with a larger community and more pre-trained checkpoints available.

StyleTTS 2

Model Specifications

Quick Start

Download from Hugging Face

License

Performance & Scoring

Benchmarks

Overall Score

Hardware Compatibility

Find the best hardware for this model

Community

About This Model

Overview

Architecture & Technical Details

Capabilities & Use Cases

Running StyleTTS 2 Locally

How It Compares