PolyAI's efficient, compact conversational TTS framework designed for fast, parallel speech generation with 10× less training data.
Access model weights, configuration files, and documentation.
No benchmark data available for this model yet.
See which devices can run this model and at what quality level.
Pheme is a compact, text-only conversational text-to-speech (TTS) framework from PolyAI, designed for efficient, parallel speech generation. At 0.3B parameters, it occupies a unique niche: a dense Transformer-based model that achieves natural conversational speech output while requiring roughly 10× less training data than comparable systems like VALL-E or SoundStorm.
PolyAI, known for enterprise-grade conversational AI deployed across healthcare, hospitality, and logistics, built Pheme to address a specific gap in the TTS landscape. Most state-of-the-art speech generation models are autoregressive — they produce tokens one at a time, which introduces latency that makes real-time conversational use impractical. Pheme breaks from that pattern by using a MaskGit-style inference approach that generates speech tokens in parallel, delivering up to 15× speed improvements over similarly sized autoregressive models.
This isn't a general-purpose language model. It's a specialized speech generation framework that prioritizes three things: parameter efficiency, data efficiency, and inference speed. For practitioners building conversational agents, voice assistants, or real-time speech applications that need to run locally, Pheme is worth serious evaluation.
Pheme uses a dense Transformer architecture with 0.3B parameters. That's small enough to run on consumer hardware without quantization, but the real architectural innovation is in how it handles speech tokenization and generation.
The framework separates semantic and acoustic tokens — a design choice that reduces the complexity of what the model needs to learn. Instead of trying to model raw audio directly, Pheme works with discrete speech tokens produced by a separate SpeechTokenizer. This separation allows the model to focus on generating natural-sounding conversational patterns rather than spending capacity on low-level acoustic details.
The parallel inference mechanism is the key differentiator. Traditional autoregressive TTS models generate one token at a time, with each step depending on the previous one. Pheme uses MaskGit-style parallel decoding, which predicts multiple tokens simultaneously and refines them through iterative masking. This yields the 15× speedup over autoregressive approaches at comparable model sizes.
Training efficiency is equally notable. The model can be trained effectively on conversational, podcast, and noisy data (the paper references GigaSpeech as a viable training source), and it achieves strong results with roughly one-tenth the data required by VALL-E or SoundStorm. For practitioners who want to fine-tune or adapt the model for specific voices or domains, this lower data requirement is a practical advantage.
The framework also supports student-teacher training with synthetic data from third-party providers to improve single-speaker quality. The codebase and pretrained models are available under the CC-BY-4.0 license, with the official repository providing training recipes for both speech-to-audio and text-to-speech pipelines.
Pheme is a conversational TTS framework. It is not designed for general language understanding, code generation, or multimodal tasks. Its strengths are in producing natural, human-like speech from text input, optimized for real-time conversational contexts.
Concrete use cases include:
The model is text-only in terms of input modality — it takes text and produces speech tokens that are then converted to audio via the SpeechTokenizer decoder. There is no support for image, audio, or video inputs.
Pheme's 0.3B parameter count makes it one of the most accessible TTS models for local deployment. Here's what you need to know.
Because this is a dense model with relatively few parameters, VRAM requirements are modest:
This model runs comfortably on virtually any modern consumer GPU:
For most users, FP16 is the practical default. The model is small enough that quantization isn't necessary for VRAM reasons on any dedicated GPU. Use FP16 unless you're targeting a device with less than 4 GB VRAM.
If you need to run on constrained hardware, INT8 offers a good balance of quality and memory savings. INT4 is available for edge cases but may introduce noticeable quality degradation in speech output — the tradeoff is less forgiving than with language models.
Parallel inference means you can expect fast generation:
The fastest path to running Pheme locally is through the official GitHub repository at PolyAI-LDN/pheme. Set up a conda environment with Python 3.10, install PyTorch and the requirements, download the pretrained SpeechTokenizer and token list, and you're ready to run train_t2s.py or train_s2a.py for inference.
There is no Ollama support for Pheme as of early 2025 — this is a specialized TTS model, not a general-purpose LLM, so you'll need to work directly with the Python codebase. The repository includes a demo directory and sample audio outputs to verify your setup.
Pheme competes in the compact TTS space, where the primary alternatives are autoregressive models like VALL-E (300M-1.5B parameters) and SoundStorm (based on the SoundStream codec with similar parameter counts).
Pheme vs VALL-E: VALL-E requires more training data and operates autoregressively, meaning higher inference latency. VALL-E can produce more diverse outputs with less data in some scenarios, but Pheme's parallel inference gives it a decisive advantage for real-time applications. If latency matters, choose Pheme. If you have abundant training data and can tolerate slower generation, VALL-E remains a strong option.
Pheme vs SoundStorm: SoundStorm uses a similar parallel decoding approach but requires the SoundStream neural audio codec and more training data. Pheme's separation of semantic and acoustic tokens, combined with its lower data requirements, makes it more practical for teams that don't have massive speech datasets. SoundStorm may produce marginally higher audio quality with sufficient data, but Pheme wins on efficiency and ease of training.
When to choose Pheme: You need real-time conversational speech generation on consumer hardware, you want to train or fine-tune with limited data, or you're building voice agents that require low latency. It's the pragmatic choice for production conversational AI running locally.