Hugging Face

Parler-TTS Mini v1

A lightweight 880M-parameter fully open-source text-to-speech model controllable via natural-language voice-description prompts.

0.88B paramsDense

View on Hugging Face Source Code Official Page

Model Specifications

Parameters0.88B

ArchitectureDense

ProviderHugging Face

Download Size3.5 GB

Community

Monthly Downloads103.2K

Likes153

Last Updated1 years ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

No benchmark data available for this model yet.

Overall Score

44.2CC

Benchmark40%

50.0

Popularity25%

46.7

Efficiency25%

22.2

Versatility10%

70.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

83 devices


Acer Veriton GN100 AI MiniAcer	SS	1.0 GB
AMD Instinct MI300XAMD	SS	1.0 GB
AMD Instinct MI325XAMD	SS	1.0 GB
AMD Instinct MI355XAMD	SS	1.0 GB
AMD Radeon RX 7600 8GBAMD	SS	1.0 GB
AMD Radeon RX 7700 XTAMD	SS	1.0 GB
AMD Radeon RX 7800 XTAMD	SS	1.0 GB
AMD Radeon RX 7900 XTAMD	SS	1.0 GB
AMD Radeon RX 7900 XTXAMD	SS	1.0 GB
AMD Radeon RX 9070AMD	SS	1.0 GB
AMD Radeon RX 9070 XTAMD	SS	1.0 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	1.0 GB
Apple M4Apple	SS	1.0 GB
Apple M4 Max (40-core GPU)Apple	SS	1.0 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	1.0 GB
Apple M5Apple	SS	1.0 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	1.0 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	1.0 GB
Apple Mac Mini (M1, 2020)Apple	SS	1.0 GB
Apple Mac Mini (M2, 2023)Apple	SS	1.0 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	1.0 GB
Apple Mac Mini (M4, 2024)Apple	SS	1.0 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	1.0 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	1.0 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	SS	1.0 GB

Rows per page

Page 1 of 4

About This Model

Overview

Parler-TTS Mini v1 is a fully open-source text-to-speech model from Hugging Face that generates natural speech from text using natural-language voice descriptions. At 0.88B parameters, it occupies a specific niche: a lightweight TTS model that gives you fine-grained control over voice characteristics without requiring enterprise-grade hardware.

Unlike proprietary TTS APIs or models that lock you into predefined voices, Parler-TTS Mini v1 lets you describe exactly how the output should sound—gender, pitch, speaking rate, background noise level, and reverberation—all in plain English. The model was trained on 45,000 hours of narrated audio data and released under Apache 2.0, meaning you can use it, modify it, and deploy it without licensing restrictions.

What sets Parler-TTS apart from other open TTS models is its natural-language conditioning. Instead of selecting voice ID numbers or uploading reference audio clips, you write a prompt like "A female speaker delivers a slightly expressive and animated speech with a moderate speed and pitch. The recording is of very high quality, with the speaker's voice sounding clear and very close up." The model interprets that description and generates matching speech. This is a reproduction of the work published in "Natural language guidance of high-fidelity text-to-speech with synthetic annotations" by Dan Lyth and Simon King.

Architecture & Technical Details

Parler-TTS Mini v1 uses a dense transformer architecture with 0.88B parameters. It is not a mixture-of-experts model—all parameters are active during inference. This means VRAM usage scales linearly with model size, but you also get consistent quality across all generations without routing tokens to different expert pathways.

The model is built on the Hugging Face transformers library and uses ParlerTTSForConditionalGeneration for inference. It processes two inputs: the text prompt (what to say) and the description prompt (how to say it). Both are tokenized separately and fed into the model for conditional generation.

Because this is a dense 0.88B model, memory requirements are modest. At full precision (FP32), the model occupies roughly 3.5 GB of VRAM. At FP16, that drops to approximately 1.8 GB. With 4-bit quantization, you can fit it in under 1 GB. The model supports SDPA (Scaled Dot-Product Attention) and Flash Attention 2, which significantly speed up generation on compatible GPUs. You can also compile the model with torch.compile for additional inference speed gains.

The model outputs audio at a sampling rate defined in its configuration—typically 16 kHz or 24 kHz, depending on the specific checkpoint. Output is generated as a raw audio array that you can save to WAV or other formats using libraries like soundfile.

Capabilities & Use Cases

Parler-TTS Mini v1 generates English speech with controllable voice characteristics through natural-language descriptions. The key capability is voice description conditioning—you control gender, pitch, speaking rate, expressiveness, background noise, proximity, and reverberation through text prompts.

Concrete use cases:

Audiobook and long-form narration generation with consistent voice characteristics across chapters
Voice assistants and conversational agents running locally with no API calls or latency from cloud services
Accessibility tools that convert text to speech with adjustable pacing and clarity for different user needs
Content creation for videos, podcasts, or presentations where you need specific voice qualities without hiring voice actors
Prototyping and testing voice interfaces where you need to rapidly iterate on different speaking styles

The model does not support speaker consistency natively—each generation is conditioned by the description prompt, not a specific speaker embedding. If you need consistent voices across multiple generations, you'll need to use the same description prompt each time, or fine-tune the model on specific speakers using the Parler-TTS training code. For speaker-consistent generation, consider the newer Parler-TTS Mini v1.1 or Large v1 checkpoints, which introduce speaker consistency features.

Running Parler-TTS Mini v1 Locally

Parler-TTS Mini v1 runs on consumer hardware without issue. Here's what you need to know for local deployment.

Minimum hardware requirements:

GPU: Any NVIDIA GPU with 4 GB VRAM (GTX 1660, RTX 3050, RTX 2060) at FP16
CPU: Modern x86 processor with AVX2 support (generation is slower but feasible)
RAM: 8 GB system memory minimum, 16 GB recommended
Storage: ~2 GB for model weights

Recommended hardware:

GPU: RTX 3060 12 GB, RTX 4060, or any RTX 30/40 series card
For Apple Silicon: M1 Pro or better with 16 GB unified memory (uses MPS backend)
RAM: 16 GB or more

VRAM requirements by quantization:

FP32: ~3.5 GB (not recommended for practical use)
FP16: ~1.8 GB (good quality, runs on most GPUs)
8-bit: ~1 GB (balanced quality/efficiency)
4-bit (Q4_K_M): ~700 MB (best for resource-constrained setups)

Expected performance:

On an RTX 4090 at FP16, you can expect real-time or faster generation for short prompts (1-5 seconds of audio generated in under a second). On an RTX 3060 at 8-bit quantization, generation is near real-time. On CPU-only systems, expect generation to take 2-5x longer than the audio duration.

Quickest way to get started:

Install the Parler-TTS library: pip install git+https://github.com/huggingface/parler-tts.git
Load the model: ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-mini-v1")
Tokenize your description and text prompt separately
Call model.generate() and save the output audio

For optimal performance on compatible NVIDIA GPUs, enable Flash Attention 2 by passing attn_implementation="flash_attention_2" when loading the model. On Apple Silicon, use the MPS device and compile the model with torch.compile for significant speedups.

How It Compares

Parler-TTS Mini v1 vs. Piper TTS: Piper is a smaller, faster TTS system optimized for edge devices and home automation. It uses VITS-based architectures and runs efficiently on Raspberry Pi-class hardware. Parler-TTS Mini v1 produces higher-quality, more natural speech and offers voice description conditioning that Piper lacks. Choose Piper if you need minimal latency and can accept robotic output. Choose Parler-TTS if audio quality and voice control matter more than raw speed.

Parler-TTS Mini v1 vs. Coqui TTS (YourTTS): YourTTS is a multilingual TTS model that supports voice cloning from short reference audio. Parler-TTS Mini v1 does not support voice cloning—it uses text descriptions instead. YourTTS requires reference audio for each voice, while Parler-TTS lets you generate new voices on the fly from descriptions. Parler-TTS also has a more permissive license (Apache 2.0 vs. Coqui's non-commercial restrictions on some models). Choose YourTTS if you need voice cloning or multilingual support. Choose Parler-TTS for English-only generation with fine-grained voice control and unrestricted licensing.

Parler-TTS Mini v1 vs. Parler-TTS Large v1: The Large variant has 2.3B parameters and produces higher-quality audio with better speaker consistency. It requires approximately 2.6x more VRAM and generates audio more slowly. Mini v1 is the pragmatic choice for local deployment on consumer hardware where you need reasonable quality without upgrading your GPU.

Related Models

Hugging Face

Parler-TTS Large v1

2.2BDense

2.2B

Hugging Face

Distil-Whisper Large v3.5

0.8BDense

0.8B

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.

0.88B

Hugging Face

Parler-TTS Mini v1

A lightweight 880M-parameter fully open-source text-to-speech model controllable via natural-language voice-description prompts.

0.88B paramsDense

View on Hugging Face Source Code Official Page

Model Specifications

Parameters0.88B

ArchitectureDense

ProviderHugging Face

Download Size3.5 GB

Community

Monthly Downloads103.2K

Likes153

Last Updated1 years ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

No benchmark data available for this model yet.

Overall Score

44.2CC

Benchmark40%

50.0

Popularity25%

46.7

Efficiency25%

22.2

Versatility10%

70.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

83 devices


Acer Veriton GN100 AI MiniAcer	SS	1.0 GB
AMD Instinct MI300XAMD	SS	1.0 GB
AMD Instinct MI325XAMD	SS	1.0 GB
AMD Instinct MI355XAMD	SS	1.0 GB
AMD Radeon RX 7600 8GBAMD	SS	1.0 GB
AMD Radeon RX 7700 XTAMD	SS	1.0 GB
AMD Radeon RX 7800 XTAMD	SS	1.0 GB
AMD Radeon RX 7900 XTAMD	SS	1.0 GB
AMD Radeon RX 7900 XTXAMD	SS	1.0 GB
AMD Radeon RX 9070AMD	SS	1.0 GB
AMD Radeon RX 9070 XTAMD	SS	1.0 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	1.0 GB
Apple M4Apple	SS	1.0 GB
Apple M4 Max (40-core GPU)Apple	SS	1.0 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	1.0 GB
Apple M5Apple	SS	1.0 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	1.0 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	1.0 GB
Apple Mac Mini (M1, 2020)Apple	SS	1.0 GB
Apple Mac Mini (M2, 2023)Apple	SS	1.0 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	1.0 GB
Apple Mac Mini (M4, 2024)Apple	SS	1.0 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	1.0 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	1.0 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	SS	1.0 GB

Rows per page

Page 1 of 4

About This Model

Overview

Architecture & Technical Details

Capabilities & Use Cases

Concrete use cases:

Audiobook and long-form narration generation with consistent voice characteristics across chapters
Voice assistants and conversational agents running locally with no API calls or latency from cloud services
Accessibility tools that convert text to speech with adjustable pacing and clarity for different user needs
Content creation for videos, podcasts, or presentations where you need specific voice qualities without hiring voice actors
Prototyping and testing voice interfaces where you need to rapidly iterate on different speaking styles

Running Parler-TTS Mini v1 Locally

Parler-TTS Mini v1 runs on consumer hardware without issue. Here's what you need to know for local deployment.

Minimum hardware requirements:

GPU: Any NVIDIA GPU with 4 GB VRAM (GTX 1660, RTX 3050, RTX 2060) at FP16
CPU: Modern x86 processor with AVX2 support (generation is slower but feasible)
RAM: 8 GB system memory minimum, 16 GB recommended
Storage: ~2 GB for model weights

Recommended hardware:

GPU: RTX 3060 12 GB, RTX 4060, or any RTX 30/40 series card
For Apple Silicon: M1 Pro or better with 16 GB unified memory (uses MPS backend)
RAM: 16 GB or more

VRAM requirements by quantization:

FP32: ~3.5 GB (not recommended for practical use)
FP16: ~1.8 GB (good quality, runs on most GPUs)
8-bit: ~1 GB (balanced quality/efficiency)
4-bit (Q4_K_M): ~700 MB (best for resource-constrained setups)

Expected performance:

Quickest way to get started:

Install the Parler-TTS library: pip install git+https://github.com/huggingface/parler-tts.git
Load the model: ParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-mini-v1")
Tokenize your description and text prompt separately
Call model.generate() and save the output audio

How It Compares

Related Models

Hugging Face

Parler-TTS Large v1

2.2BDense

2.2B

Hugging Face

Distil-Whisper Large v3.5

0.8BDense

0.8B

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.