PolyAI

Pheme

PolyAI's efficient, compact conversational TTS framework designed for fast, parallel speech generation with 10× less training data.

0.3B paramsDense

View on Hugging Face Source Code Official Page

Model Specifications

Parameters0.3B

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

CC-BY-4.0View Full License

Performance & Scoring

Benchmarks

No benchmark data available for this model yet.

Overall Score

37.6DD

Benchmark40%

50.0

Popularity25%

2.0

Efficiency25%

44.4

Versatility10%

60.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

85 devices


Acer Veriton GN100 AI MiniAcer	SS	0.7 GB
AMD Instinct MI300XAMD	SS	0.7 GB
AMD Instinct MI325XAMD	SS	0.7 GB
AMD Instinct MI355XAMD	SS	0.7 GB
AMD Radeon RX 7600 8GBAMD	SS	0.7 GB
AMD Radeon RX 7700 XTAMD	SS	0.7 GB
AMD Radeon RX 7800 XTAMD	SS	0.7 GB
AMD Radeon RX 7900 XTAMD	SS	0.7 GB
AMD Radeon RX 7900 XTXAMD	SS	0.7 GB
AMD Radeon RX 9070AMD	SS	0.7 GB
AMD Radeon RX 9070 XTAMD	SS	0.7 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	0.7 GB
Apple M4Apple	SS	0.7 GB
Apple M4 Max (40-core GPU)Apple	SS	0.7 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	0.7 GB
Apple M5Apple	SS	0.7 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	0.7 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	0.7 GB
Apple Mac Mini (M1, 2020)Apple	SS	0.7 GB
Apple Mac Mini (M2, 2023)Apple	SS	0.7 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	0.7 GB
Apple Mac Mini (M4, 2024)Apple	SS	0.7 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	0.7 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	0.7 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	SS	0.7 GB

Rows per page

Page 1 of 4

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.

About This Model

Overview

Pheme is a compact, text-only conversational text-to-speech (TTS) framework from PolyAI, designed for efficient, parallel speech generation. At 0.3B parameters, it occupies a unique niche: a dense Transformer-based model that achieves natural conversational speech output while requiring roughly 10× less training data than comparable systems like VALL-E or SoundStorm.

PolyAI, known for enterprise-grade conversational AI deployed across healthcare, hospitality, and logistics, built Pheme to address a specific gap in the TTS landscape. Most state-of-the-art speech generation models are autoregressive — they produce tokens one at a time, which introduces latency that makes real-time conversational use impractical. Pheme breaks from that pattern by using a MaskGit-style inference approach that generates speech tokens in parallel, delivering up to 15× speed improvements over similarly sized autoregressive models.

This isn't a general-purpose language model. It's a specialized speech generation framework that prioritizes three things: parameter efficiency, data efficiency, and inference speed. For practitioners building conversational agents, voice assistants, or real-time speech applications that need to run locally, Pheme is worth serious evaluation.

Architecture & Technical Details

Pheme uses a dense Transformer architecture with 0.3B parameters. That's small enough to run on consumer hardware without quantization, but the real architectural innovation is in how it handles speech tokenization and generation.

The framework separates semantic and acoustic tokens — a design choice that reduces the complexity of what the model needs to learn. Instead of trying to model raw audio directly, Pheme works with discrete speech tokens produced by a separate SpeechTokenizer. This separation allows the model to focus on generating natural-sounding conversational patterns rather than spending capacity on low-level acoustic details.

The parallel inference mechanism is the key differentiator. Traditional autoregressive TTS models generate one token at a time, with each step depending on the previous one. Pheme uses MaskGit-style parallel decoding, which predicts multiple tokens simultaneously and refines them through iterative masking. This yields the 15× speedup over autoregressive approaches at comparable model sizes.

Training efficiency is equally notable. The model can be trained effectively on conversational, podcast, and noisy data (the paper references GigaSpeech as a viable training source), and it achieves strong results with roughly one-tenth the data required by VALL-E or SoundStorm. For practitioners who want to fine-tune or adapt the model for specific voices or domains, this lower data requirement is a practical advantage.

The framework also supports student-teacher training with synthetic data from third-party providers to improve single-speaker quality. The codebase and pretrained models are available under the CC-BY-4.0 license, with the official repository providing training recipes for both speech-to-audio and text-to-speech pipelines.

Capabilities & Use Cases

Pheme is a conversational TTS framework. It is not designed for general language understanding, code generation, or multimodal tasks. Its strengths are in producing natural, human-like speech from text input, optimized for real-time conversational contexts.

Concrete use cases include:

Voice agents and conversational AI: PolyAI's own enterprise deployments use Pheme-class technology for customer service voice agents. The model's low latency and natural output make it suitable for handling phone calls, appointment scheduling, and customer support interactions without the robotic cadence that plagues older TTS systems.

Real-time speech applications: Applications that need sub-second response times — think voice-controlled assistants, accessibility tools, or live captioning systems — benefit from the parallel inference architecture. The 15× speed improvement over autoregressive models means you can generate speech faster than real-time on modest hardware.

On-device speech generation: At 0.3B parameters, Pheme fits comfortably within the constraints of edge devices and consumer GPUs. This makes it viable for applications where cloud connectivity is unreliable, expensive, or prohibited by data privacy requirements.

Fine-tuned voice synthesis: The lower data requirements for training mean teams can adapt Pheme to specific voices or domains without needing massive proprietary datasets. If you have a few hours of conversational audio for a target speaker, you can likely fine-tune effectively.

The model is text-only in terms of input modality — it takes text and produces speech tokens that are then converted to audio via the SpeechTokenizer decoder. There is no support for image, audio, or video inputs.

Running Pheme Locally

Pheme's 0.3B parameter count makes it one of the most accessible TTS models for local deployment. Here's what you need to know.

VRAM Requirements

Because this is a dense model with relatively few parameters, VRAM requirements are modest:

FP16 (full precision): Approximately 600-700 MB VRAM for the model weights alone. With the SpeechTokenizer and runtime overhead, expect around 1.5-2 GB total VRAM usage.
INT8 quantization: Reduces memory footprint to roughly 350-400 MB for weights. Total usage around 1-1.5 GB.
INT4 quantization: Weights compress to approximately 200 MB. Total usage under 1 GB, making this viable on very constrained hardware.

Consumer Hardware

This model runs comfortably on virtually any modern consumer GPU:

NVIDIA RTX 4090 (24 GB): Overkill for this model. You can run multiple instances or batch inference without any VRAM pressure.
NVIDIA RTX 3060 (12 GB): More than enough. Expect fast inference even at FP16.
NVIDIA RTX 4060 (8 GB): Comfortable at any quantization level.
Apple M4 Max (unified memory): Runs without issues. The unified memory architecture handles the model easily.
Apple M1/M2 with 8 GB RAM: Viable with INT4 quantization. May need to close other memory-intensive applications.
Integrated graphics: Possible with INT4 quantization and careful memory management, but not recommended for real-time use.

Recommended Quantization

For most users, FP16 is the practical default. The model is small enough that quantization isn't necessary for VRAM reasons on any dedicated GPU. Use FP16 unless you're targeting a device with less than 4 GB VRAM.

If you need to run on constrained hardware, INT8 offers a good balance of quality and memory savings. INT4 is available for edge cases but may introduce noticeable quality degradation in speech output — the tradeoff is less forgiving than with language models.

Performance Expectations

Parallel inference means you can expect fast generation:

RTX 4090: 100+ tokens per second (speech tokens, not text tokens). Sub-100ms generation for short utterances.
RTX 3060: 40-60 tokens per second. Real-time or faster for most use cases.
Apple M4 Max: 50-70 tokens per second via Metal backend.
Apple M1: 20-30 tokens per second. Adequate for real-time conversational use.

Getting Started

The fastest path to running Pheme locally is through the official GitHub repository at PolyAI-LDN/pheme. Set up a conda environment with Python 3.10, install PyTorch and the requirements, download the pretrained SpeechTokenizer and token list, and you're ready to run train_t2s.py or train_s2a.py for inference.

There is no Ollama support for Pheme as of early 2025 — this is a specialized TTS model, not a general-purpose LLM, so you'll need to work directly with the Python codebase. The repository includes a demo directory and sample audio outputs to verify your setup.

How It Compares

Pheme competes in the compact TTS space, where the primary alternatives are autoregressive models like VALL-E (300M-1.5B parameters) and SoundStorm (based on the SoundStream codec with similar parameter counts).

Pheme vs VALL-E: VALL-E requires more training data and operates autoregressively, meaning higher inference latency. VALL-E can produce more diverse outputs with less data in some scenarios, but Pheme's parallel inference gives it a decisive advantage for real-time applications. If latency matters, choose Pheme. If you have abundant training data and can tolerate slower generation, VALL-E remains a strong option.

Pheme vs SoundStorm: SoundStorm uses a similar parallel decoding approach but requires the SoundStream neural audio codec and more training data. Pheme's separation of semantic and acoustic tokens, combined with its lower data requirements, makes it more practical for teams that don't have massive speech datasets. SoundStorm may produce marginally higher audio quality with sufficient data, but Pheme wins on efficiency and ease of training.

When to choose Pheme: You need real-time conversational speech generation on consumer hardware, you want to train or fine-tune with limited data, or you're building voice agents that require low latency. It's the pragmatic choice for production conversational AI running locally.

Pheme

Model Specifications

Quick Start

Download from Hugging Face

License

Performance & Scoring

Benchmarks

Overall Score

Hardware Compatibility

Find the best hardware for this model

Community

About This Model

Overview

Architecture & Technical Details

Capabilities & Use Cases

Running Pheme Locally

VRAM Requirements

Consumer Hardware

Recommended Quantization

Performance Expectations

Getting Started

How It Compares