A lightweight 880M-parameter fully open-source text-to-speech model controllable via natural-language voice-description prompts.
Access model weights, configuration files, and documentation.
No benchmark data available for this model yet.
See which devices can run this model and at what quality level.
Parler-TTS Mini v1 is a fully open-source text-to-speech model from Hugging Face that generates natural speech from text using natural-language voice descriptions. At 0.88B parameters, it occupies a specific niche: a lightweight TTS model that gives you fine-grained control over voice characteristics without requiring enterprise-grade hardware.
Unlike proprietary TTS APIs or models that lock you into predefined voices, Parler-TTS Mini v1 lets you describe exactly how the output should sound—gender, pitch, speaking rate, background noise level, and reverberation—all in plain English. The model was trained on 45,000 hours of narrated audio data and released under Apache 2.0, meaning you can use it, modify it, and deploy it without licensing restrictions.
What sets Parler-TTS apart from other open TTS models is its natural-language conditioning. Instead of selecting voice ID numbers or uploading reference audio clips, you write a prompt like "A female speaker delivers a slightly expressive and animated speech with a moderate speed and pitch. The recording is of very high quality, with the speaker's voice sounding clear and very close up." The model interprets that description and generates matching speech. This is a reproduction of the work published in "Natural language guidance of high-fidelity text-to-speech with synthetic annotations" by Dan Lyth and Simon King.
Parler-TTS Mini v1 uses a dense transformer architecture with 0.88B parameters. It is not a mixture-of-experts model—all parameters are active during inference. This means VRAM usage scales linearly with model size, but you also get consistent quality across all generations without routing tokens to different expert pathways.
The model is built on the Hugging Face transformers library and uses ParlerTTSForConditionalGeneration for inference. It processes two inputs: the text prompt (what to say) and the description prompt (how to say it). Both are tokenized separately and fed into the model for conditional generation.
Because this is a dense 0.88B model, memory requirements are modest. At full precision (FP32), the model occupies roughly 3.5 GB of VRAM. At FP16, that drops to approximately 1.8 GB. With 4-bit quantization, you can fit it in under 1 GB. The model supports SDPA (Scaled Dot-Product Attention) and Flash Attention 2, which significantly speed up generation on compatible GPUs. You can also compile the model with torch.compile for additional inference speed gains.
The model outputs audio at a sampling rate defined in its configuration—typically 16 kHz or 24 kHz, depending on the specific checkpoint. Output is generated as a raw audio array that you can save to WAV or other formats using libraries like soundfile.
Parler-TTS Mini v1 generates English speech with controllable voice characteristics through natural-language descriptions. The key capability is voice description conditioning—you control gender, pitch, speaking rate, expressiveness, background noise, proximity, and reverberation through text prompts.
Concrete use cases:
The model does not support speaker consistency natively—each generation is conditioned by the description prompt, not a specific speaker embedding. If you need consistent voices across multiple generations, you'll need to use the same description prompt each time, or fine-tune the model on specific speakers using the Parler-TTS training code. For speaker-consistent generation, consider the newer Parler-TTS Mini v1.1 or Large v1 checkpoints, which introduce speaker consistency features.
Parler-TTS Mini v1 runs on consumer hardware without issue. Here's what you need to know for local deployment.
Minimum hardware requirements:
Recommended hardware:
VRAM requirements by quantization:
Expected performance:
On an RTX 4090 at FP16, you can expect real-time or faster generation for short prompts (1-5 seconds of audio generated in under a second). On an RTX 3060 at 8-bit quantization, generation is near real-time. On CPU-only systems, expect generation to take 2-5x longer than the audio duration.
Quickest way to get started:
pip install git+https://github.com/huggingface/parler-tts.gitParlerTTSForConditionalGeneration.from_pretrained("parler-tts/parler-tts-mini-v1")model.generate() and save the output audioFor optimal performance on compatible NVIDIA GPUs, enable Flash Attention 2 by passing attn_implementation="flash_attention_2" when loading the model. On Apple Silicon, use the MPS device and compile the model with torch.compile for significant speedups.
Parler-TTS Mini v1 vs. Piper TTS: Piper is a smaller, faster TTS system optimized for edge devices and home automation. It uses VITS-based architectures and runs efficiently on Raspberry Pi-class hardware. Parler-TTS Mini v1 produces higher-quality, more natural speech and offers voice description conditioning that Piper lacks. Choose Piper if you need minimal latency and can accept robotic output. Choose Parler-TTS if audio quality and voice control matter more than raw speed.
Parler-TTS Mini v1 vs. Coqui TTS (YourTTS): YourTTS is a multilingual TTS model that supports voice cloning from short reference audio. Parler-TTS Mini v1 does not support voice cloning—it uses text descriptions instead. YourTTS requires reference audio for each voice, while Parler-TTS lets you generate new voices on the fly from descriptions. Parler-TTS also has a more permissive license (Apache 2.0 vs. Coqui's non-commercial restrictions on some models). Choose YourTTS if you need voice cloning or multilingual support. Choose Parler-TTS for English-only generation with fine-grained voice control and unrestricted licensing.
Parler-TTS Mini v1 vs. Parler-TTS Large v1: The Large variant has 2.3B parameters and produces higher-quality audio with better speaker consistency. It requires approximately 2.6x more VRAM and generates audio more slowly. Mini v1 is the pragmatic choice for local deployment on consumer hardware where you need reasonable quality without upgrading your GPU.