An open-source text-to-speech system created by "inverting" OpenAI Whisper, aiming to be commercially safe and fully hackable.
Access model weights, configuration files, and documentation.
No benchmark data available for this model yet.
See which devices can run this model and at what quality level.
WhisperSpeech is an open-source text-to-speech system developed by Collabora, built on an unconventional approach: inverting OpenAI's Whisper speech recognition model. Rather than training a TTS system from scratch, the team reversed Whisper's encoder-decoder architecture to generate speech from text input. The model uses a dense architecture with undisclosed parameters, and is released under the MIT license.
This matters because most commercial TTS systems are either closed-source or trained on unlicensed data, making them risky for production use. WhisperSpeech was trained exclusively on the English LibreLight dataset—properly licensed speech recordings—and all code is open source. The project's stated goal is to be for speech what Stable Diffusion is for images: powerful, hackable, and commercially safe.
WhisperSpeech currently supports English text-to-speech generation. A multilingual release is in development, with early experiments showing promising results across English, Polish, and French.
WhisperSpeech inverts Whisper's architecture by repurposing its learned representations. Where Whisper takes audio as input and produces text, WhisperSpeech takes text and produces audio tokens. The system uses a two-stage pipeline: a semantic token model that converts text into discrete speech representations, followed by an acoustic model that renders those tokens into waveform audio.
The model uses a dense architecture, meaning all parameters are active during inference. While Collabora has not disclosed the exact parameter count, the model is designed to run efficiently on consumer hardware. The GitHub repository reports 12× faster-than-real-time inference on an RTX 4090 after optimizations including torch.compile, KV-caching, and layer tuning.
The system supports voice cloning through semantic token manipulation. The team demonstrated that a tiny S2A (semantic-to-acoustic) model trained on English, Polish, and French data can clone voices using semantic tokens frozen from a model trained only on English and Polish—suggesting the semantic tokenizer may generalize across languages without retraining.
WhisperSpeech excels at generating natural-sounding English speech from text input. Current capabilities include:
Practical use cases include:
The model is not a general-purpose AI system—it generates speech only. It does not support transcription, translation, or multimodal tasks.
WhisperSpeech is designed for local deployment. The inference pipeline runs in PyTorch and supports standard optimization techniques.
The model runs on consumer GPUs. Based on reported performance:
On an RTX 4090 with torch.compile and KV-caching enabled, WhisperSpeech generates audio at approximately 12× real-time speed. This means generating 12 seconds of speech takes roughly 1 second of compute. Without optimizations, expect lower throughput.
The quickest way to test WhisperSpeech is through the provided Colab notebook, which handles environment setup and dependency installation. For local deployment:
pip install -r requirements.txtfrom whisperspeech import TTS; tts = TTS()tts.synthesize("Your text here", "output.wav")Voice cloning requires a reference audio file. The repository includes example notebooks demonstrating both standard TTS and voice cloning workflows.
WhisperSpeech occupies a unique position in the open-source TTS landscape. Its key differentiator is the inverted Whisper architecture and the commitment to commercially safe training data.
vs. Coqui TTS: Coqui offers more language support and a larger model ecosystem, but its training data licensing varies by model. WhisperSpeech's MIT license and fully licensed training data make it the safer choice for commercial deployment. Coqui may offer more natural output on some voices, but WhisperSpeech's voice cloning capabilities are more straightforward to implement.
vs. Bark (Suno): Bark produces more expressive speech with non-verbal sounds (laughter, sighs) but is significantly larger and slower to run locally. WhisperSpeech's 12× real-time performance on consumer hardware makes it more practical for production workloads. Bark also has unresolved licensing questions around its training data.
Choose WhisperSpeech when you need a TTS system you can deploy commercially without legal risk, and when inference speed matters more than maximum expressiveness. Choose alternatives if you need broader language support today or require non-verbal vocalizations in generated speech.