Blind head-to-head listening test for text-to-speech models, ranked by Bradley-Terry on pairwise wins.
TTS Arena plays two anonymous audio clips for the same text and asks the listener which sounds better. Naturalness, prosody, voice identity stability, and emotion all feed into the preference. The aggregate Bradley-Terry rating is the single best proxy for "which TTS model feels human" today.
Listeners do not see which model produced which clip. Wins and losses feed a Bradley-Terry rating per model. We normalize the published rating to a 0–100 scale so it sits alongside WER and MOS on this page.
No scores yet for this benchmark.
Not enough scored models yet.
Not enough scored models yet.
They measure related but different things. MOS is an absolute 1–5 rating averaged across listeners; TTS Arena is a relative ranking based on side-by-side preference. Arena is harder to game and updates faster, but MOS gives a sense of absolute quality.
Based on score correlations across our database.