Benchmarks · 1996

MOS: Mean Opinion Score

Name: MOS: Mean Opinion Score
Creator: ITU-T standard, used widely in TTS research
Published: 1996
Keywords: MOS, AI benchmark, audio model evaluation, ITU-T standard, used widely in TTS research

Classic 1–5 listener rating for TTS naturalness. The longest-running quality metric in speech.

Open Dataset

Models Tested

Top Score

5.0 / 5

Published

1996

Source

ITU-T standard, used widely in TTS research

How It Works

MOS is the oldest TTS quality metric still in use. Listeners hear a clip and rate it on a 1 (bad) to 5 (excellent) scale. Mean Opinion Score is the average of those ratings. It is an absolute quality read (4.5 is "very good", 3 is "fair"), which makes it complementary to TTS Arena.

Each clip is rated by 20–100 listeners. Scores are averaged with confidence intervals. The metric is most useful when many systems are scored under the same protocol; raw MOS numbers across papers are not always directly comparable.

Dataset size

Reported by model authors and benchmark studies. Each MOS score averages 20–100 listener ratings per clip.

Mean score

4.3 / 5

Median score

4.2 / 5

Open / Closed

7 / 2

Top Scorers

#	Model	Lab	Source	Score
01	CosyVoice 2.0	Alibaba	Open	5.0 / 5
02	Eleven Multilingual v2	ElevenLabs	Closed	4.5 / 5
03	Eleven v3	ElevenLabs	Closed	4.5 / 5
04	XTTS v2	Coqui	Open	4.2 / 5
05	Kokoro v1.0	hexgrad	Open	4.2 / 5
06	StyleTTS 2	Columbia University	Open	4.2 / 5
07	OpenVoice	MyShell AI	Open	4.1 / 5
08	WhisperSpeech	Collabora	Open	3.9 / 5
09	Vokan TTS	ShoukanLabs	Open	3.8 / 5

Score Distribution

Open vs Closed Source

Gap on MOS:0.5pts open leads

Top Open-Source Models

1CosyVoice 2.05
2XTTS v24.2
3Kokoro v1.04.2

Top Closed-Source Models

1Eleven Multilingual v24.5
2Eleven v34.5

Score vs Parameter Count

7 model(s) with undisclosed parameter counts not shown. Most closed-source labs do not publish model size.

Average Score by Lab

ElevenLabs
4.5 / 5n = 2

Most Correlated Benchmarks

Not enough scored models yet.

What It Captures Well

Absolute quality scale that translates to non-technical stakeholders.
Long-running, so historical comparisons are possible.
Naturally averages across listener variance.

Where It Falls Short

Protocol-dependent: different listener pools, scales, and instructions can shift MOS by 0.3–0.5.
Saturates near 4.5: top systems cluster in a narrow band.
Cannot compare across studies without careful normalization.

Frequently Asked Questions

What MOS score sounds human?

Native speech tends to score 4.3–4.6. Top neural TTS systems in 2026 are in the same range on read speech. Below 4.0 most listeners can clearly tell it is synthetic.

Related Benchmarks

Based on score correlations across our database.

Pearson r —

TTS Arena

n = 7

Pearson r —

WER

n = 2

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

Benchmarks · 1996

MOS: Mean Opinion Score

Classic 1–5 listener rating for TTS naturalness. The longest-running quality metric in speech.

Open Dataset

Models Tested

Top Score

5.0 / 5

Published

1996

Source

ITU-T standard, used widely in TTS research

How It Works

Dataset size

Reported by model authors and benchmark studies. Each MOS score averages 20–100 listener ratings per clip.

Mean score

4.3 / 5

Median score

4.2 / 5

Open / Closed

7 / 2

Top Scorers

#	Model	Lab	Source	Score
01	CosyVoice 2.0	Alibaba	Open	5.0 / 5
02	Eleven Multilingual v2	ElevenLabs	Closed	4.5 / 5
03	Eleven v3	ElevenLabs	Closed	4.5 / 5
04	XTTS v2	Coqui	Open	4.2 / 5
05	Kokoro v1.0	hexgrad	Open	4.2 / 5
06	StyleTTS 2	Columbia University	Open	4.2 / 5
07	OpenVoice	MyShell AI	Open	4.1 / 5
08	WhisperSpeech	Collabora	Open	3.9 / 5
09	Vokan TTS	ShoukanLabs	Open	3.8 / 5

Score Distribution

Open vs Closed Source

Gap on MOS:0.5pts open leads

Top Open-Source Models

1CosyVoice 2.05
2XTTS v24.2
3Kokoro v1.04.2

Top Closed-Source Models

1Eleven Multilingual v24.5
2Eleven v34.5

Score vs Parameter Count

7 model(s) with undisclosed parameter counts not shown. Most closed-source labs do not publish model size.

Average Score by Lab

ElevenLabs
4.5 / 5n = 2

Most Correlated Benchmarks

Not enough scored models yet.

What It Captures Well

Absolute quality scale that translates to non-technical stakeholders.
Long-running, so historical comparisons are possible.
Naturally averages across listener variance.

Where It Falls Short

Protocol-dependent: different listener pools, scales, and instructions can shift MOS by 0.3–0.5.
Saturates near 4.5: top systems cluster in a narrow band.
Cannot compare across studies without careful normalization.

Frequently Asked Questions

What MOS score sounds human?

Native speech tends to score 4.3–4.6. Top neural TTS systems in 2026 are in the same range on read speech. Below 4.0 most listeners can clearly tell it is synthetic.

Related Benchmarks

Based on score correlations across our database.

Pearson r —

TTS Arena

n = 7

Pearson r —

WER

n = 2

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.