A 2.2B-parameter fully open-source text-to-speech model controllable via natural-language descriptions of voice and acoustic characteristics.
Access model weights, configuration files, and documentation.
No benchmark data available for this model yet.
See which devices can run this model and at what quality level.
Parler-TTS Large v1 is a 2.2B-parameter text-to-speech model from Hugging Face, released under the Apache 2.0 license. It's a dense transformer that generates speech from text input, with a critical differentiator: you control the voice characteristics through natural language prompts, not through speaker embeddings or reference audio clips.
This model is part of the Parler-TTS project, which reproduces the approach from the "Natural language guidance of high-fidelity text-to-speech with synthetic annotations" paper by Stability AI and Edinburgh University. Unlike many TTS models that require proprietary datasets or closed-source training pipelines, Parler-TTS is fully open — all datasets, preprocessing code, training scripts, and weights are public.
The 2.2B parameter count puts it in the same weight class as models like Meta's Llama 2 7B, but for audio generation rather than text. That parameter count is meaningful for local deployment: it's large enough to produce high-quality, natural speech, but small enough to run on consumer hardware with proper quantization.
Parler-TTS Large v1 uses a dense transformer architecture with 2.2B parameters. It's built on the encoder-decoder framework from the original paper, where a text encoder processes the input prompt and a decoder generates audio tokens.
The model operates in two stages internally: it first generates a coarse audio representation, then refines it into a high-fidelity waveform. This two-stage approach is what enables the natural-language control of voice characteristics — the model learns to map descriptive text (e.g., "a female speaker with a moderate pitch and clear enunciation") to acoustic features during training.
Context length is not specified by the provider, but for practical TTS use, the relevant limitation is audio duration rather than text length. The model handles standard paragraph-length inputs without issue.
Key technical details for local deployment:
Parler-TTS Large v1 excels at controllable speech generation. You describe the voice you want in plain English, and the model produces speech matching that description. Capabilities include:
Concrete use cases where this model delivers:
The model is English-only. If you need multilingual TTS, look at models like Coqui TTS or XTTS.
This is where Parler-TTS Large v1 becomes practical for developers. The 2.2B parameter count is manageable on consumer GPUs with the right setup.
For most users, 8-bit quantization offers the best balance of quality and VRAM efficiency. The voice characteristic control remains accurate, and you save ~40% VRAM compared to FP16. Use 4-bit only if you're constrained to 4GB or less VRAM and quality isn't critical.
On an RTX 4090 with FP16, expect approximately:
Performance scales with GPU compute. An RTX 3060 at 8-bit will be roughly 2-3x slower than an RTX 4090 at FP16.
The fastest path to local inference:
pip install git+https://github.com/huggingface/parler-tts.gitOllama does not currently support Parler-TTS. You'll need to use the native Python library or the Hugging Face Transformers integration.
vs. Coqui TTS (1.2B parameters): Coqui is smaller and faster, but lacks natural-language voice control. You need reference audio to clone a voice with Coqui. Parler-TTS gives you text-based control, which is more flexible for programmatic use. Choose Coqui if you need lower latency and have reference audio; choose Parler-TTS if you need on-the-fly voice customization.
vs. Bark (1.2B parameters): Bark produces more expressive speech with emotional range, but it's significantly slower and less controllable for specific voice characteristics. Bark also has higher VRAM requirements relative to its parameter count due to inefficient architecture. Choose Bark for creative applications where expressiveness matters more than consistency; choose Parler-TTS for production pipelines where you need reliable, controllable output.
vs. XTTS v2 (1.6B parameters): XTTS is multilingual and supports voice cloning from short audio samples. Parler-TTS is English-only but offers finer-grained control through text descriptions. XTTS requires a GPU with at least 6GB VRAM; Parler-TTS can run on less with quantization. Choose XTTS for multilingual needs; choose Parler-TTS for English applications where you want to script voice characteristics.