Hugging Face

Parler-TTS Large v1

A 2.2B-parameter fully open-source text-to-speech model controllable via natural-language descriptions of voice and acoustic characteristics.

2.2B paramsDense

View on Hugging Face Source Code Official Page

Model Specifications

Parameters2.2B

ArchitectureDense

ProviderHugging Face

Download Size18.7 GB

Community

Monthly Downloads8.4K

Likes273

Last Updated1 years ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

No benchmark data available for this model yet.

Overall Score

39.2DD

Benchmark40%

50.0

Popularity25%

40.0

Efficiency25%

8.9

Versatility10%

70.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

83 devices


Acer Veriton GN100 AI MiniAcer	SS	1.8 GB
AMD Instinct MI300XAMD	SS	1.8 GB
AMD Instinct MI325XAMD	SS	1.8 GB
AMD Instinct MI355XAMD	SS	1.8 GB
AMD Radeon RX 7600 8GBAMD	SS	1.8 GB
AMD Radeon RX 7700 XTAMD	SS	1.8 GB
AMD Radeon RX 7800 XTAMD	SS	1.8 GB
AMD Radeon RX 7900 XTAMD	SS	1.8 GB
AMD Radeon RX 7900 XTXAMD	SS	1.8 GB
AMD Radeon RX 9070AMD	SS	1.8 GB
AMD Radeon RX 9070 XTAMD	SS	1.8 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	1.8 GB
Apple M4Apple	SS	1.8 GB
Apple M4 Max (40-core GPU)Apple	SS	1.8 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	1.8 GB
Apple M5Apple	SS	1.8 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	1.8 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	1.8 GB
Apple Mac Mini (M1, 2020)Apple	SS	1.8 GB
Apple Mac Mini (M2, 2023)Apple	SS	1.8 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	1.8 GB
Apple Mac Mini (M4, 2024)Apple	SS	1.8 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	1.8 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	1.8 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	SS	1.8 GB

Rows per page

Page 1 of 4

About This Model

Overview

Parler-TTS Large v1 is a 2.2B-parameter text-to-speech model from Hugging Face, released under the Apache 2.0 license. It's a dense transformer that generates speech from text input, with a critical differentiator: you control the voice characteristics through natural language prompts, not through speaker embeddings or reference audio clips.

This model is part of the Parler-TTS project, which reproduces the approach from the "Natural language guidance of high-fidelity text-to-speech with synthetic annotations" paper by Stability AI and Edinburgh University. Unlike many TTS models that require proprietary datasets or closed-source training pipelines, Parler-TTS is fully open — all datasets, preprocessing code, training scripts, and weights are public.

The 2.2B parameter count puts it in the same weight class as models like Meta's Llama 2 7B, but for audio generation rather than text. That parameter count is meaningful for local deployment: it's large enough to produce high-quality, natural speech, but small enough to run on consumer hardware with proper quantization.

Architecture & Technical Details

Parler-TTS Large v1 uses a dense transformer architecture with 2.2B parameters. It's built on the encoder-decoder framework from the original paper, where a text encoder processes the input prompt and a decoder generates audio tokens.

The model operates in two stages internally: it first generates a coarse audio representation, then refines it into a high-fidelity waveform. This two-stage approach is what enables the natural-language control of voice characteristics — the model learns to map descriptive text (e.g., "a female speaker with a moderate pitch and clear enunciation") to acoustic features during training.

Context length is not specified by the provider, but for practical TTS use, the relevant limitation is audio duration rather than text length. The model handles standard paragraph-length inputs without issue.

Key technical details for local deployment:

Architecture type: Dense encoder-decoder transformer
Training data: 45,000 hours of narrated audio from audiobooks
Supported features: Flash Attention 2, SDPA, model compilation
Output format: WAV audio at the model's native sampling rate
Dependencies: PyTorch, transformers, the Parler-TTS library

Capabilities & Use Cases

Parler-TTS Large v1 excels at controllable speech generation. You describe the voice you want in plain English, and the model produces speech matching that description. Capabilities include:

Voice characteristic control: Gender, pitch, speaking rate, volume, and reverberation through text prompts
Acoustic environment control: Background noise level, recording quality, proximity of the speaker
Preset speakers: 34 built-in voices with consistent characteristics across generations
Natural prosody: The model produces speech with appropriate intonation and rhythm, not robotic monotone

Concrete use cases where this model delivers:

Content creation: Generate voiceovers for videos, podcasts, or audiobooks where you need consistent voice characteristics across long sessions
Accessibility tools: Convert written content to speech with adjustable voice parameters for different users
Prototyping and testing: Generate varied speech samples for voice UI testing without recording actual speakers
Local voice assistants: Run a TTS pipeline entirely on-device without cloud dependencies or API costs

The model is English-only. If you need multilingual TTS, look at models like Coqui TTS or XTTS.

Running Parler-TTS Large v1 Locally

This is where Parler-TTS Large v1 becomes practical for developers. The 2.2B parameter count is manageable on consumer GPUs with the right setup.

VRAM Requirements

FP16 (full precision): ~5-6GB VRAM minimum. This is the recommended starting point for quality.
8-bit quantization: ~3-4GB VRAM. Good quality, runs on most consumer GPUs.
4-bit quantization: ~2-3GB VRAM. Functional, but noticeable quality degradation in voice characteristics.

Realistic Hardware

RTX 4090 (24GB): Runs FP16 easily. Generates audio in near real-time for short prompts.
RTX 3090 (24GB): Same capability as 4090, slightly slower inference.
RTX 4070 (12GB): Runs 8-bit quantization comfortably. Good for batch generation.
M4 Max (64GB unified): Runs FP16. Apple Silicon support requires the nightly PyTorch build for bfloat16.
RTX 3060 (12GB): Runs 8-bit quantization. Slower but functional for single-generation tasks.

Recommended Quantization

For most users, 8-bit quantization offers the best balance of quality and VRAM efficiency. The voice characteristic control remains accurate, and you save ~40% VRAM compared to FP16. Use 4-bit only if you're constrained to 4GB or less VRAM and quality isn't critical.

Expected Performance

On an RTX 4090 with FP16, expect approximately:

Short prompts (1-2 seconds of speech): Generated in under 1 second
Paragraph-length (30-60 seconds of speech): Generated in 2-5 seconds
Long-form (5+ minutes): Generated faster than real-time, but VRAM becomes the binding constraint for very long sequences

Performance scales with GPU compute. An RTX 3060 at 8-bit will be roughly 2-3x slower than an RTX 4090 at FP16.

Getting Started

The fastest path to local inference:

Install the Parler-TTS library: pip install git+https://github.com/huggingface/parler-tts.git
Load the model with your chosen quantization
Pass your text prompt and voice description
Write the output to a WAV file

Ollama does not currently support Parler-TTS. You'll need to use the native Python library or the Hugging Face Transformers integration.

How It Compares

vs. Coqui TTS (1.2B parameters): Coqui is smaller and faster, but lacks natural-language voice control. You need reference audio to clone a voice with Coqui. Parler-TTS gives you text-based control, which is more flexible for programmatic use. Choose Coqui if you need lower latency and have reference audio; choose Parler-TTS if you need on-the-fly voice customization.

vs. Bark (1.2B parameters): Bark produces more expressive speech with emotional range, but it's significantly slower and less controllable for specific voice characteristics. Bark also has higher VRAM requirements relative to its parameter count due to inefficient architecture. Choose Bark for creative applications where expressiveness matters more than consistency; choose Parler-TTS for production pipelines where you need reliable, controllable output.

vs. XTTS v2 (1.6B parameters): XTTS is multilingual and supports voice cloning from short audio samples. Parler-TTS is English-only but offers finer-grained control through text descriptions. XTTS requires a GPU with at least 6GB VRAM; Parler-TTS can run on less with quantization. Choose XTTS for multilingual needs; choose Parler-TTS for English applications where you want to script voice characteristics.

Related Models

Hugging Face

Parler-TTS Mini v1

0.88BDense

0.88B

Hugging Face

Distil-Whisper Large v3.5

0.8BDense

0.8B

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.

2.2B

Hugging Face

Parler-TTS Large v1

A 2.2B-parameter fully open-source text-to-speech model controllable via natural-language descriptions of voice and acoustic characteristics.

2.2B paramsDense

View on Hugging Face Source Code Official Page

Model Specifications

Parameters2.2B

ArchitectureDense

ProviderHugging Face

Download Size18.7 GB

Community

Monthly Downloads8.4K

Likes273

Last Updated1 years ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

No benchmark data available for this model yet.

Overall Score

39.2DD

Benchmark40%

50.0

Popularity25%

40.0

Efficiency25%

8.9

Versatility10%

70.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

83 devices


Acer Veriton GN100 AI MiniAcer	SS	1.8 GB
AMD Instinct MI300XAMD	SS	1.8 GB
AMD Instinct MI325XAMD	SS	1.8 GB
AMD Instinct MI355XAMD	SS	1.8 GB
AMD Radeon RX 7600 8GBAMD	SS	1.8 GB
AMD Radeon RX 7700 XTAMD	SS	1.8 GB
AMD Radeon RX 7800 XTAMD	SS	1.8 GB
AMD Radeon RX 7900 XTAMD	SS	1.8 GB
AMD Radeon RX 7900 XTXAMD	SS	1.8 GB
AMD Radeon RX 9070AMD	SS	1.8 GB
AMD Radeon RX 9070 XTAMD	SS	1.8 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	1.8 GB
Apple M4Apple	SS	1.8 GB
Apple M4 Max (40-core GPU)Apple	SS	1.8 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	1.8 GB
Apple M5Apple	SS	1.8 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	1.8 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	1.8 GB
Apple Mac Mini (M1, 2020)Apple	SS	1.8 GB
Apple Mac Mini (M2, 2023)Apple	SS	1.8 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	1.8 GB
Apple Mac Mini (M4, 2024)Apple	SS	1.8 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	1.8 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	1.8 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	SS	1.8 GB

Rows per page

Page 1 of 4

About This Model

Overview

Architecture & Technical Details

Key technical details for local deployment:

Architecture type: Dense encoder-decoder transformer
Training data: 45,000 hours of narrated audio from audiobooks
Supported features: Flash Attention 2, SDPA, model compilation
Output format: WAV audio at the model's native sampling rate
Dependencies: PyTorch, transformers, the Parler-TTS library

Capabilities & Use Cases

Parler-TTS Large v1 excels at controllable speech generation. You describe the voice you want in plain English, and the model produces speech matching that description. Capabilities include:

Voice characteristic control: Gender, pitch, speaking rate, volume, and reverberation through text prompts
Acoustic environment control: Background noise level, recording quality, proximity of the speaker
Preset speakers: 34 built-in voices with consistent characteristics across generations
Natural prosody: The model produces speech with appropriate intonation and rhythm, not robotic monotone

Concrete use cases where this model delivers:

Content creation: Generate voiceovers for videos, podcasts, or audiobooks where you need consistent voice characteristics across long sessions
Accessibility tools: Convert written content to speech with adjustable voice parameters for different users
Prototyping and testing: Generate varied speech samples for voice UI testing without recording actual speakers
Local voice assistants: Run a TTS pipeline entirely on-device without cloud dependencies or API costs

The model is English-only. If you need multilingual TTS, look at models like Coqui TTS or XTTS.

Running Parler-TTS Large v1 Locally

This is where Parler-TTS Large v1 becomes practical for developers. The 2.2B parameter count is manageable on consumer GPUs with the right setup.

VRAM Requirements

FP16 (full precision): ~5-6GB VRAM minimum. This is the recommended starting point for quality.
8-bit quantization: ~3-4GB VRAM. Good quality, runs on most consumer GPUs.
4-bit quantization: ~2-3GB VRAM. Functional, but noticeable quality degradation in voice characteristics.

Realistic Hardware

RTX 4090 (24GB): Runs FP16 easily. Generates audio in near real-time for short prompts.
RTX 3090 (24GB): Same capability as 4090, slightly slower inference.
RTX 4070 (12GB): Runs 8-bit quantization comfortably. Good for batch generation.
M4 Max (64GB unified): Runs FP16. Apple Silicon support requires the nightly PyTorch build for bfloat16.
RTX 3060 (12GB): Runs 8-bit quantization. Slower but functional for single-generation tasks.

Recommended Quantization

Expected Performance

On an RTX 4090 with FP16, expect approximately:

Short prompts (1-2 seconds of speech): Generated in under 1 second
Paragraph-length (30-60 seconds of speech): Generated in 2-5 seconds
Long-form (5+ minutes): Generated faster than real-time, but VRAM becomes the binding constraint for very long sequences

Performance scales with GPU compute. An RTX 3060 at 8-bit will be roughly 2-3x slower than an RTX 4090 at FP16.

Getting Started

The fastest path to local inference:

Install the Parler-TTS library: pip install git+https://github.com/huggingface/parler-tts.git
Load the model with your chosen quantization
Pass your text prompt and voice description
Write the output to a WAV file

Ollama does not currently support Parler-TTS. You'll need to use the native Python library or the Hugging Face Transformers integration.

How It Compares

Related Models

Hugging Face

Parler-TTS Mini v1

0.88BDense

0.88B

Hugging Face

Distil-Whisper Large v3.5

0.8BDense

0.8B

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.