Blind head-to-head listening test for text-to-speech models, ranked by Bradley-Terry on pairwise wins.
TTS Arena plays two anonymous audio clips for the same text and asks the listener which sounds better. Naturalness, prosody, voice identity stability, and emotion all feed into the preference. The aggregate Bradley-Terry rating is the single best proxy for "which TTS model feels human" today.
Listeners do not see which model produced which clip. Wins and losses feed a Bradley-Terry rating per model. We normalize the published rating to a 0–100 scale so it sits alongside WER and MOS on this page.
| # | Model | Lab | Source | Score |
|---|---|---|---|---|
| 01 | Eleven v3 | ElevenLabs | Closed | 100.0 |
| 02 | OpenVoice | MyShell AI | Open | 98.3 |
| 03 | MiniMax Speech 2.6 | MiniMax | Closed | 96.3 |
| 04 | Eleven Multilingual v2 | ElevenLabs | Closed | 93.7 |
| 05 | OpenAI TTS-1-HD | OpenAI | Closed | 92.7 |
| 06 | Kokoro v1.0 | hexgrad | Open | 89.5 |
| 07 | Fish Speech v1.5 | Fish Audio | Open | 84.6 |
| 08 | OpenVoice V2 | MyShell AI | Open | 80.2 |
| 09 | XTTS v2 | Coqui | Open | 76.0 |
| 10 | StyleTTS 2 | Columbia University | Open | 73.7 |
| 11 | MetaVoice-1B | MetaVoice | Open | 69.0 |
| 12 | Vokan TTS | ShoukanLabs | Open | 0.0 |
10 model(s) with undisclosed parameter counts not shown. Most closed-source labs do not publish model size.
Not enough scored models yet.
They measure related but different things. MOS is an absolute 1–5 rating averaged across listeners; TTS Arena is a relative ranking based on side-by-side preference. Arena is harder to game and updates faster, but MOS gives a sense of absolute quality.
Based on score correlations across our database.