Classic 1–5 listener rating for TTS naturalness. The longest-running quality metric in speech.
MOS is the oldest TTS quality metric still in use. Listeners hear a clip and rate it on a 1 (bad) to 5 (excellent) scale. Mean Opinion Score is the average of those ratings. It is an absolute quality read — 4.5 is "very good", 3 is "fair" — which makes it complementary to TTS Arena.
Each clip is rated by 20–100 listeners. Scores are averaged with confidence intervals. The metric is most useful when many systems are scored under the same protocol; raw MOS numbers across papers are not always directly comparable.
| # | Model | Lab | Source | Score |
|---|---|---|---|---|
| 01 | CosyVoice 2.0 | Alibaba | Open | 5.0 / 5 |
| 02 | Eleven Multilingual v2 | ElevenLabs | Closed | 4.5 / 5 |
| 03 | Eleven v3 | ElevenLabs | Closed | 4.5 / 5 |
| 04 | XTTS v2 | Coqui | Open | 4.2 / 5 |
| 05 | StyleTTS 2 | Columbia University | Open | 4.2 / 5 |
| 06 | WhisperSpeech | Collabora | Open | 3.9 / 5 |
5 model(s) with undisclosed parameter counts not shown. Most closed-source labs do not publish model size.
Not enough scored models yet.
Native speech tends to score 4.3–4.6. Top neural TTS systems in 2026 are in the same range on read speech. Below 4.0 most listeners can clearly tell it is synthetic.
Based on score correlations across our database.