Benchmarks · 2024

TTS Arena: TTS Arena Human Preference Leaderboard

Name: TTS Arena: TTS Arena Human Preference Leaderboard
Creator: Hugging Face TTS-AGI
Published: 2024
Keywords: TTS Arena, AI benchmark, audio model evaluation, Hugging Face TTS-AGI

Blind head-to-head listening test for text-to-speech models, ranked by Bradley-Terry on pairwise wins.

Open Dataset

Models Tested

Top Score

100.0

Published

2024

Source

Hugging Face TTS-AGI

How It Works

TTS Arena plays two anonymous audio clips for the same text and asks the listener which sounds better. Naturalness, prosody, voice identity stability, and emotion all feed into the preference. The aggregate Bradley-Terry rating is the single best proxy for "which TTS model feels human" today.

Listeners do not see which model produced which clip. Wins and losses feed a Bradley-Terry rating per model. We normalize the published rating to a 0–100 scale so it sits alongside WER and MOS on this page.

Dataset size

Tens of thousands of blind A/B listening comparisons across open and closed TTS systems.

Mean score

79.4

Median score

87.0

Open / Closed

8 / 4

Top Scorers

#	Model	Lab	Source	Score
01	Eleven v3	ElevenLabs	Closed	100.0
02	OpenVoice	MyShell AI	Open	98.1
03	MiniMax Speech 2.6	MiniMax	Closed	95.7
04	Eleven Multilingual v2	ElevenLabs	Closed	93.8
05	OpenAI TTS-1-HD	OpenAI	Closed	92.5
06	Kokoro v1.0	hexgrad	Open	89.4
07	Fish Speech v1.5	Fish Audio	Open	84.6
08	OpenVoice V2	MyShell AI	Open	80.7
09	XTTS v2	Coqui	Open	76.2
10	StyleTTS 2	Columbia University	Open	73.8
11	MetaVoice-1B	MetaVoice	Open	68.3
12	Vokan TTS	ShoukanLabs	Open	0.0

Score Distribution

Open vs Closed Source

Gap on TTS Arena:+1.9pts closed leads

Top Open-Source Models

1OpenVoice98.1
2Kokoro v1.089.4
3Fish Speech v1.584.6

Top Closed-Source Models

1Eleven v3100
2MiniMax Speech 2.695.7
3Eleven Multilingual v293.8

Score vs Parameter Count

10 model(s) with undisclosed parameter counts not shown. Most closed-source labs do not publish model size.

Average Score by Lab

ElevenLabs
96.9n = 2
MyShell AI
89.4n = 2

Most Correlated Benchmarks

Not enough scored models yet.

What It Captures Well

Captures naturalness in a way that no automatic metric reliably measures.
Easy to compare TTS systems that target different voice styles.
Updated continuously as new releases land.

Where It Falls Short

Preference is subjective and shifts with listener expectations.
New entries take time to accumulate enough votes for a stable rating.
Does not test ASR (speech-to-text) ability: use WER for that.

Frequently Asked Questions

Is TTS Arena better than MOS?

They measure related but different things. MOS is an absolute 1–5 rating averaged across listeners; TTS Arena is a relative ranking based on side-by-side preference. Arena is harder to game and updates faster, but MOS gives a sense of absolute quality.

Related Benchmarks

Based on score correlations across our database.

Pearson r —

MOS

n = 7