High-quality multilingual VITS-based TTS library from MyShell.ai supporting English, Spanish, French, Chinese, Japanese and Korean, fast enough for CPU real-time inference.
Access model weights, configuration files, and documentation.
No benchmark data available for this model yet.
See which devices can run this model and at what quality level.
MeloTTS is a high-quality multilingual text-to-speech library developed by MyShell AI in collaboration with MIT. It uses a VITS-based architecture—a fully end-to-end TTS approach that combines a variational autoencoder with a flow-based decoder and a transformer-based text encoder. The model parameters are undisclosed, but the architecture is dense, meaning all parameters are active during inference.
What sets MeloTTS apart is its ability to run real-time inference on CPU hardware while delivering natural-sounding speech across six languages. This makes it a practical choice for developers who need local TTS without GPU acceleration. The library supports English (with five accents: American, British, Indian, Australian, and Default), Spanish, French, Chinese (with mixed English support), Japanese, and Korean.
The MIT license means you can use MeloTTS in commercial projects, modify it, and distribute it without licensing fees. This is a significant advantage over many TTS models that carry restrictive licenses or require API access.
MeloTTS builds on the VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) framework. The architecture uses a conditional variational autoencoder that jointly models text and audio in a latent space. A normalizing flow module maps between the prior distribution and the posterior, enabling high-fidelity waveform generation.
The model is dense, meaning all parameters are active at inference time. While the exact parameter count isn't disclosed, the model's efficiency is evident from its CPU real-time performance. The VITS architecture is inherently more efficient than two-stage TTS systems (which separate acoustic model and vocoder), as it generates waveforms directly from text in a single forward pass.
Context length is not specified, but in practice, MeloTTS handles paragraph-length text without issues. The model processes text token-by-token through its transformer encoder, so longer inputs increase latency linearly. For most TTS use cases—sentences to short paragraphs—this isn't a practical limitation.
The model's efficiency comes from its lightweight design and the VITS framework's ability to generate high-quality audio with fewer parameters than comparable TTS systems. This is why it can run on CPU hardware while maintaining natural prosody and voice quality.
MeloTTS excels at producing natural-sounding speech across multiple languages. The English models offer five distinct accents, each with appropriate pronunciation patterns. The American and British models are particularly strong, with natural intonation and rhythm.
The Chinese model supports mixed Chinese-English text, which is important for real-world applications where technical terms or names appear in English within Chinese sentences. This isn't a trivial feature—many TTS systems break or produce unnatural pauses when switching between languages.
Concrete use cases include:
The model doesn't support voice cloning or speaker adaptation out of the box, though the repository includes training scripts for custom datasets. If you need speaker-specific voices, you'll need to fine-tune on your own data.
MeloTTS is designed for local deployment. The GitHub repository provides a straightforward Python API, and you can install it via pip. Here's what you need to know about running it on your hardware.
Minimum hardware: Any modern CPU with AVX2 support (Intel Core i5-8xxx or AMD Ryzen 2xxx and newer). The model runs entirely on CPU—no GPU required for real-time inference. This makes it accessible on laptops, mini PCs, and even single-board computers.
CPU real-time inference: The model generates audio faster than real-time on most modern CPUs. On an Intel Core i7-12700, expect 2-3x real-time speed for English. On a Raspberry Pi 4, you'll get around 0.5-0.8x real-time speed—usable for shorter texts but not ideal for long-form generation.
RAM requirements: Approximately 1-2GB of system RAM for loading a single language model. If you load multiple language models simultaneously, multiply accordingly.
GPU acceleration: While not required, MeloTTS can use CUDA if available. On an RTX 4090 or RTX 3090, you'll get 5-10x real-time speed. The VRAM footprint is minimal—under 1GB for any single language model.
Quantization: The repository doesn't include pre-quantized models, but you can apply standard PyTorch quantization techniques. FP16 inference works well and reduces memory usage by roughly half. INT8 quantization is possible but may degrade audio quality noticeably.
Performance metrics: On CPU, expect 50-100 tokens per second for English text. On GPU, this increases to 500-1000+ tokens per second. Audio quality remains consistent across hardware—the only variable is generation speed.
Setup steps:
pip install MeloTTSThe repository also includes a Web UI and CLI interface contributed by the community, making it accessible for non-programmers.
Compared to Coqui TTS, MeloTTS offers better multilingual support out of the box. Coqui TTS has more languages available through community models, but the quality varies significantly. MeloTTS's six languages are all production-quality, with consistent naturalness across them. Coqui TTS also requires more setup and dependency management.
Against Piper TTS, MeloTTS produces higher quality audio. Piper is optimized for speed and low resource usage on embedded devices, but its voice quality is noticeably more robotic. MeloTTS sounds more natural, especially for longer sentences and varied prosody. Piper wins on extreme edge cases (Raspberry Pi Zero-class hardware), but MeloTTS is the better choice for any device that can run it.
The tradeoff is model size and setup complexity. Piper's models are tiny (5-50MB) and trivial to deploy. MeloTTS models are larger (200-500MB per language) and require a Python environment. If you're deploying to thousands of embedded devices with strict memory constraints, Piper makes sense. For any scenario where audio quality matters and you have a capable CPU, MeloTTS is the superior option.