MyShell AI

MeloTTS

High-quality multilingual VITS-based TTS library from MyShell.ai supporting English, Spanish, French, Chinese, Japanese and Korean, fast enough for CPU real-time inference.

B paramsDense

View on Hugging Face Source Code Official Page

Model Specifications

ParametersnullB

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

MITView Full License

Performance & Scoring

Benchmarks

No benchmark data available for this model yet.

Overall Score

56.7BB

Benchmark40%

50.0

Popularity25%

56.0

Efficiency25%

68.9

Versatility10%

55.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

85 devices


Acer Veriton GN100 AI MiniAcer	SS	0.5 GB
AMD Instinct MI300XAMD	SS	0.5 GB
AMD Instinct MI325XAMD	SS	0.5 GB
AMD Instinct MI355XAMD	SS	0.5 GB
AMD Radeon RX 7600 8GBAMD	SS	0.5 GB
AMD Radeon RX 7700 XTAMD	SS	0.5 GB
AMD Radeon RX 7800 XTAMD	SS	0.5 GB
AMD Radeon RX 7900 XTAMD	SS	0.5 GB
AMD Radeon RX 7900 XTXAMD	SS	0.5 GB
AMD Radeon RX 9070AMD	SS	0.5 GB
AMD Radeon RX 9070 XTAMD	SS	0.5 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	0.5 GB
Apple M4Apple	SS	0.5 GB
Apple M4 Max (40-core GPU)Apple	SS	0.5 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	0.5 GB
Apple M5Apple	SS	0.5 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	0.5 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	0.5 GB
Apple Mac Mini (M1, 2020)Apple	SS	0.5 GB
Apple Mac Mini (M2, 2023)Apple	SS	0.5 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	0.5 GB
Apple Mac Mini (M4, 2024)Apple	SS	0.5 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	0.5 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	0.5 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	SS	0.5 GB

Rows per page

Page 1 of 4

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.

About This Model

Overview

MeloTTS is a high-quality multilingual text-to-speech library developed by MyShell AI in collaboration with MIT. It uses a VITS-based architecture—a fully end-to-end TTS approach that combines a variational autoencoder with a flow-based decoder and a transformer-based text encoder. The model parameters are undisclosed, but the architecture is dense, meaning all parameters are active during inference.

What sets MeloTTS apart is its ability to run real-time inference on CPU hardware while delivering natural-sounding speech across six languages. This makes it a practical choice for developers who need local TTS without GPU acceleration. The library supports English (with five accents: American, British, Indian, Australian, and Default), Spanish, French, Chinese (with mixed English support), Japanese, and Korean.

The MIT license means you can use MeloTTS in commercial projects, modify it, and distribute it without licensing fees. This is a significant advantage over many TTS models that carry restrictive licenses or require API access.

Architecture & Technical Details

MeloTTS builds on the VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) framework. The architecture uses a conditional variational autoencoder that jointly models text and audio in a latent space. A normalizing flow module maps between the prior distribution and the posterior, enabling high-fidelity waveform generation.

The model is dense, meaning all parameters are active at inference time. While the exact parameter count isn't disclosed, the model's efficiency is evident from its CPU real-time performance. The VITS architecture is inherently more efficient than two-stage TTS systems (which separate acoustic model and vocoder), as it generates waveforms directly from text in a single forward pass.

Context length is not specified, but in practice, MeloTTS handles paragraph-length text without issues. The model processes text token-by-token through its transformer encoder, so longer inputs increase latency linearly. For most TTS use cases—sentences to short paragraphs—this isn't a practical limitation.

The model's efficiency comes from its lightweight design and the VITS framework's ability to generate high-quality audio with fewer parameters than comparable TTS systems. This is why it can run on CPU hardware while maintaining natural prosody and voice quality.

Capabilities & Use Cases

MeloTTS excels at producing natural-sounding speech across multiple languages. The English models offer five distinct accents, each with appropriate pronunciation patterns. The American and British models are particularly strong, with natural intonation and rhythm.

The Chinese model supports mixed Chinese-English text, which is important for real-world applications where technical terms or names appear in English within Chinese sentences. This isn't a trivial feature—many TTS systems break or produce unnatural pauses when switching between languages.

Concrete use cases include:

Voice assistants: Run a local voice assistant with multilingual support without cloud dependencies
Accessibility tools: Screen readers, reading assistants, and navigation aids for users who need speech output
Language learning: Generate natural pronunciation examples for learners of English, Spanish, French, Chinese, Japanese, or Korean
Content creation: Produce voiceovers for videos, podcasts, or presentations using local hardware
Edge devices: Deploy on Raspberry Pi, laptops, or other CPU-only devices for embedded TTS applications

The model doesn't support voice cloning or speaker adaptation out of the box, though the repository includes training scripts for custom datasets. If you need speaker-specific voices, you'll need to fine-tune on your own data.

Running MeloTTS Locally

MeloTTS is designed for local deployment. The GitHub repository provides a straightforward Python API, and you can install it via pip. Here's what you need to know about running it on your hardware.

Minimum hardware: Any modern CPU with AVX2 support (Intel Core i5-8xxx or AMD Ryzen 2xxx and newer). The model runs entirely on CPU—no GPU required for real-time inference. This makes it accessible on laptops, mini PCs, and even single-board computers.

CPU real-time inference: The model generates audio faster than real-time on most modern CPUs. On an Intel Core i7-12700, expect 2-3x real-time speed for English. On a Raspberry Pi 4, you'll get around 0.5-0.8x real-time speed—usable for shorter texts but not ideal for long-form generation.

RAM requirements: Approximately 1-2GB of system RAM for loading a single language model. If you load multiple language models simultaneously, multiply accordingly.

GPU acceleration: While not required, MeloTTS can use CUDA if available. On an RTX 4090 or RTX 3090, you'll get 5-10x real-time speed. The VRAM footprint is minimal—under 1GB for any single language model.

Quantization: The repository doesn't include pre-quantized models, but you can apply standard PyTorch quantization techniques. FP16 inference works well and reduces memory usage by roughly half. INT8 quantization is possible but may degrade audio quality noticeably.

Performance metrics: On CPU, expect 50-100 tokens per second for English text. On GPU, this increases to 500-1000+ tokens per second. Audio quality remains consistent across hardware—the only variable is generation speed.

Setup steps:

Install with pip install MeloTTS
Download the language model(s) you need
Use the Python API to generate speech from text

The repository also includes a Web UI and CLI interface contributed by the community, making it accessible for non-programmers.

How It Compares

Compared to Coqui TTS, MeloTTS offers better multilingual support out of the box. Coqui TTS has more languages available through community models, but the quality varies significantly. MeloTTS's six languages are all production-quality, with consistent naturalness across them. Coqui TTS also requires more setup and dependency management.

Against Piper TTS, MeloTTS produces higher quality audio. Piper is optimized for speed and low resource usage on embedded devices, but its voice quality is noticeably more robotic. MeloTTS sounds more natural, especially for longer sentences and varied prosody. Piper wins on extreme edge cases (Raspberry Pi Zero-class hardware), but MeloTTS is the better choice for any device that can run it.

The tradeoff is model size and setup complexity. Piper's models are tiny (5-50MB) and trivial to deploy. MeloTTS models are larger (200-500MB per language) and require a Python environment. If you're deploying to thousands of embedded devices with strict memory constraints, Piper makes sense. For any scenario where audio quality matters and you have a capable CPU, MeloTTS is the superior option.

MeloTTS

Model Specifications

Quick Start

Download from Hugging Face

License

Performance & Scoring

Benchmarks

Overall Score

Hardware Compatibility

Find the best hardware for this model

Community

About This Model

Overview

Architecture & Technical Details

Capabilities & Use Cases

Running MeloTTS Locally

How It Compares

Related Models

OpenVoice V2

OpenVoice