Collabora

WhisperSpeech

An open-source text-to-speech system created by "inverting" OpenAI Whisper, aiming to be commercially safe and fully hackable.

B paramsDense

View on Hugging Face Source Code Official Page

Model Specifications

ParametersnullB

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

MITView Full License

Performance & Scoring

Benchmarks

No benchmark data available for this model yet.

Overall Score

46.1CC

Benchmark40%

50.0

Popularity25%

13.3

Efficiency25%

68.9

Versatility10%

55.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

85 devices


Acer Veriton GN100 AI MiniAcer	SS	0.5 GB
AMD Instinct MI300XAMD	SS	0.5 GB
AMD Instinct MI325XAMD	SS	0.5 GB
AMD Instinct MI355XAMD	SS	0.5 GB
AMD Radeon RX 7600 8GBAMD	SS	0.5 GB
AMD Radeon RX 7700 XTAMD	SS	0.5 GB
AMD Radeon RX 7800 XTAMD	SS	0.5 GB
AMD Radeon RX 7900 XTAMD	SS	0.5 GB
AMD Radeon RX 7900 XTXAMD	SS	0.5 GB
AMD Radeon RX 9070AMD	SS	0.5 GB
AMD Radeon RX 9070 XTAMD	SS	0.5 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	0.5 GB
Apple M4Apple	SS	0.5 GB
Apple M4 Max (40-core GPU)Apple	SS	0.5 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	0.5 GB
Apple M5Apple	SS	0.5 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	0.5 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	0.5 GB
Apple Mac Mini (M1, 2020)Apple	SS	0.5 GB
Apple Mac Mini (M2, 2023)Apple	SS	0.5 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	0.5 GB
Apple Mac Mini (M4, 2024)Apple	SS	0.5 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	0.5 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	0.5 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	SS	0.5 GB

Rows per page

Page 1 of 4

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.

About This Model

Overview

WhisperSpeech is an open-source text-to-speech system developed by Collabora, built on an unconventional approach: inverting OpenAI's Whisper speech recognition model. Rather than training a TTS system from scratch, the team reversed Whisper's encoder-decoder architecture to generate speech from text input. The model uses a dense architecture with undisclosed parameters, and is released under the MIT license.

This matters because most commercial TTS systems are either closed-source or trained on unlicensed data, making them risky for production use. WhisperSpeech was trained exclusively on the English LibreLight dataset—properly licensed speech recordings—and all code is open source. The project's stated goal is to be for speech what Stable Diffusion is for images: powerful, hackable, and commercially safe.

WhisperSpeech currently supports English text-to-speech generation. A multilingual release is in development, with early experiments showing promising results across English, Polish, and French.

Architecture & Technical Details

WhisperSpeech inverts Whisper's architecture by repurposing its learned representations. Where Whisper takes audio as input and produces text, WhisperSpeech takes text and produces audio tokens. The system uses a two-stage pipeline: a semantic token model that converts text into discrete speech representations, followed by an acoustic model that renders those tokens into waveform audio.

The model uses a dense architecture, meaning all parameters are active during inference. While Collabora has not disclosed the exact parameter count, the model is designed to run efficiently on consumer hardware. The GitHub repository reports 12× faster-than-real-time inference on an RTX 4090 after optimizations including torch.compile, KV-caching, and layer tuning.

The system supports voice cloning through semantic token manipulation. The team demonstrated that a tiny S2A (semantic-to-acoustic) model trained on English, Polish, and French data can clone voices using semantic tokens frozen from a model trained only on English and Polish—suggesting the semantic tokenizer may generalize across languages without retraining.

Capabilities & Use Cases

WhisperSpeech excels at generating natural-sounding English speech from text input. Current capabilities include:

Text-to-speech synthesis: Convert arbitrary English text into spoken audio
Voice cloning: Clone a voice from a reference audio sample (demonstrated with historical speeches and multilingual recordings)
Code-switching: Mix languages within a single sentence—demonstrated with English project names embedded in Polish speech
Fast local inference: Achieves 12× real-time speed on an RTX 4090

Practical use cases include:

Audiobook and podcast production: Generate narration without licensing voice actors
Content localization: Produce voiceovers for video content
Accessibility tools: Build screen readers and assistive technologies on licensed data
Chatbot voice interfaces: Add spoken output to conversational AI systems
Audio editing: Replace or modify specific words in existing recordings (in development)

The model is not a general-purpose AI system—it generates speech only. It does not support transcription, translation, or multimodal tasks.

Running WhisperSpeech Locally

WhisperSpeech is designed for local deployment. The inference pipeline runs in PyTorch and supports standard optimization techniques.

Hardware Requirements

The model runs on consumer GPUs. Based on reported performance:

Minimum: RTX 3060 12GB or equivalent—sufficient for basic inference at reduced batch sizes
Recommended: RTX 4090 24GB—achieves 12× real-time performance with full optimizations
CPU inference: Possible but significantly slower; not recommended for production use
Apple Silicon: Not officially benchmarked, but should run on M-series chips with MPS acceleration

Performance

On an RTX 4090 with torch.compile and KV-caching enabled, WhisperSpeech generates audio at approximately 12× real-time speed. This means generating 12 seconds of speech takes roughly 1 second of compute. Without optimizations, expect lower throughput.

Getting Started

The quickest way to test WhisperSpeech is through the provided Colab notebook, which handles environment setup and dependency installation. For local deployment:

Clone the repository from GitHub
Install dependencies: pip install -r requirements.txt
Load the model: from whisperspeech import TTS; tts = TTS()
Generate speech: tts.synthesize("Your text here", "output.wav")

Voice cloning requires a reference audio file. The repository includes example notebooks demonstrating both standard TTS and voice cloning workflows.

How It Compares

WhisperSpeech occupies a unique position in the open-source TTS landscape. Its key differentiator is the inverted Whisper architecture and the commitment to commercially safe training data.

vs. Coqui TTS: Coqui offers more language support and a larger model ecosystem, but its training data licensing varies by model. WhisperSpeech's MIT license and fully licensed training data make it the safer choice for commercial deployment. Coqui may offer more natural output on some voices, but WhisperSpeech's voice cloning capabilities are more straightforward to implement.

vs. Bark (Suno): Bark produces more expressive speech with non-verbal sounds (laughter, sighs) but is significantly larger and slower to run locally. WhisperSpeech's 12× real-time performance on consumer hardware makes it more practical for production workloads. Bark also has unresolved licensing questions around its training data.

Choose WhisperSpeech when you need a TTS system you can deploy commercially without legal risk, and when inference speed matters more than maximum expressiveness. Choose alternatives if you need broader language support today or require non-verbal vocalizations in generated speech.

WhisperSpeech

Model Specifications

Quick Start

Download from Hugging Face

License

Performance & Scoring

Benchmarks

Overall Score

Hardware Compatibility

Find the best hardware for this model

Community

About This Model

Overview

Architecture & Technical Details

Capabilities & Use Cases

Running WhisperSpeech Locally

Hardware Requirements

Performance

Getting Started

How It Compares