Benchmarks · 2024

MMLU-PRO: Massive Multitask Language Understanding, Professional

Name: MMLU-PRO: Massive Multitask Language Understanding, Professional
Creator: TIGER-Lab
Published: 2024
Keywords: MMLU-PRO, AI benchmark, text model evaluation, TIGER-Lab

A harder, harder-to-game replacement for the original MMLU, covering reasoning across 14 academic and professional subjects.

Open Dataset Read Paper

Models Tested

Top Score

88.6

Published

2024

Source

TIGER-Lab

How It Works

MMLU-PRO is a re-engineered version of the older MMLU benchmark. It keeps the broad coverage — math, law, engineering, medicine, business, philosophy — but raises the difficulty floor and replaces the easy four-choice format with ten choices per question. The bigger answer space cuts random-guess scores from 25% down to 10%, so the gap between strong and weak models is much more visible.

Models are evaluated zero-shot or with chain-of-thought. Each question has one correct answer and nine distractors. Scoring is percent correct across the full 12K set, with per-subject breakdowns. The dataset deliberately removes the noisiest, most-memorized questions from the original MMLU and adds harder reasoning items pulled from textbooks and exams.

Dataset size

12,032 questions across 14 subjects, each with 10 answer choices.

Mean score

80.8

Median score

84.0

Open / Closed

15 / 4

Top Scorers

#	Model	Lab	Source	Score
01	Gemini 3 Flash	Google	Closed	88.6
02	Qwen3.5-397B-A17B	Alibaba	Open	87.8
03	DeepSeek-V4-Pro	DeepSeek	Open	87.5
04	Kimi K2.5	Moonshot AI	Open	87.1
05	DeepSeek-V4-Flash	DeepSeek	Open	86.4
06	Qwen3.6-27B	Alibaba	Open	86.2
07	Qwen3.6 35B-A3B	Alibaba	Open	85.2
08	Gemma 4 31B IT	Google	Open	85.2
09	DeepSeek-V3.2	DeepSeek	Open	85.0
10	DeepSeek-R1	DeepSeek	Open	84.0
11	Nvidia Nemotron 3 Super	NVIDIA	Open	83.7
12	Gemma 4 26B-A4B IT	Google	Open	82.6
13	Qwen3.5-9B	Alibaba	Open	82.5
14	GPT-5.2	OpenAI	Closed	80.0
15	Claude Opus 4.6	Anthropic	Closed	78.5

Score Distribution

Open vs Closed Source

Gap on MMLU-PRO:+0.8pts closed leads

Top Open-Source Models

1Qwen3.5-397B-A17B87.8
2DeepSeek-V4-Pro87.5
3Kimi K2.587.1

Top Closed-Source Models

1Gemini 3 Flash88.6
2GPT-5.280
3Claude Opus 4.678.5

Score vs Parameter Count

4 model(s) with undisclosed parameter counts not shown. Most closed-source labs do not publish model size.

Average Score by Lab

Alibaba
85.4n = 4
DeepSeek
81.5n = 5
Google
77.2n = 5
Anthropic
75.3n = 2

Most Correlated Benchmarks

AIME 2026
+0.88n = 14
GPQA
+0.85n = 18
SWE-Pro
+0.69n = 8
Arena Score
+0.54n = 12
Terminal Bench
+0.17n = 12
SWE-Verified
+0.11n = 12
HLE
+0.10n = 13
Pearson r: −1 to +1. Positive means the two benchmarks rank models in similar order; negative means the opposite.

What It Captures Well

Broad coverage across 14 fields, so it is a reasonable single number for general capability.
10-answer format reduces guessing noise and widens the gap between models.
Hand-cleaned to remove badly-worded items that plagued the original MMLU.

Where It Falls Short

Still mostly multiple choice, so it under-measures generative reasoning quality.
Some subjects are over-represented in the training data of common base models, which inflates scores.
Does not test long-context, code, or agentic ability.

Frequently Asked Questions

How is MMLU-PRO different from MMLU?

MMLU-PRO has harder questions, ten answer choices instead of four, and a curated set that removes weak items from the original. Top models drop 10–20 points compared to MMLU because there is less room to guess and less room to coast on memorization.

What MMLU-PRO score should I expect from a strong open model?

Strong open-weight models in the 30B–70B range typically score 60–75%. The very best frontier models in 2026 are above 85%. Anything under 50% is well behind the field on general knowledge tasks.

Does a high MMLU-PRO score mean a model is good at coding?

Not directly. MMLU-PRO measures broad reasoning and recall. For coding ability, look at SWE-Verified and SWE-Pro. The two correlate, but specialized code models can score modestly on MMLU-PRO and very well on the SWE family.

Is MMLU-PRO still possible to game?

Harder than the original, but not impossible. Some training datasets include MMLU-PRO derivatives, which inflates scores. Pair it with GPQA and HLE to spot models that look strong only because the test leaked.

Related Benchmarks

Based on score correlations across our database.

Pearson r +0.88

Picking the Right Model for Your Use Case?

We help product and engineering teams turn benchmark scores into shipped features. Free first conversation.

Benchmarks · 2024

MMLU-PRO: Massive Multitask Language Understanding, Professional

A harder, harder-to-game replacement for the original MMLU, covering reasoning across 14 academic and professional subjects.

Open Dataset Read Paper

Models Tested

Top Score

88.6

Published

2024

Source

TIGER-Lab

How It Works

Dataset size

12,032 questions across 14 subjects, each with 10 answer choices.

Mean score

80.8

Median score

84.0

Open / Closed

15 / 4

Top Scorers

#	Model	Lab	Source	Score
01	Gemini 3 Flash	Google	Closed	88.6
02	Qwen3.5-397B-A17B	Alibaba	Open	87.8
03	DeepSeek-V4-Pro	DeepSeek	Open	87.5
04	Kimi K2.5	Moonshot AI	Open	87.1
05	DeepSeek-V4-Flash	DeepSeek	Open	86.4
06	Qwen3.6-27B	Alibaba	Open	86.2
07	Qwen3.6 35B-A3B	Alibaba	Open	85.2
08	Gemma 4 31B IT	Google	Open	85.2
09	DeepSeek-V3.2	DeepSeek	Open	85.0
10	DeepSeek-R1	DeepSeek	Open	84.0
11	Nvidia Nemotron 3 Super	NVIDIA	Open	83.7
12	Gemma 4 26B-A4B IT	Google	Open	82.6
13	Qwen3.5-9B	Alibaba	Open	82.5
14	GPT-5.2	OpenAI	Closed	80.0
15	Claude Opus 4.6	Anthropic	Closed	78.5

Score Distribution

Open vs Closed Source

Gap on MMLU-PRO:+0.8pts closed leads

Top Open-Source Models

1Qwen3.5-397B-A17B87.8
2DeepSeek-V4-Pro87.5
3Kimi K2.587.1

Top Closed-Source Models

1Gemini 3 Flash88.6
2GPT-5.280
3Claude Opus 4.678.5

Score vs Parameter Count

4 model(s) with undisclosed parameter counts not shown. Most closed-source labs do not publish model size.

Average Score by Lab

Alibaba
85.4n = 4
DeepSeek
81.5n = 5
Google
77.2n = 5
Anthropic
75.3n = 2

Most Correlated Benchmarks

AIME 2026
+0.88n = 14
GPQA
+0.85n = 18
SWE-Pro
+0.69n = 8
Arena Score
+0.54n = 12
Terminal Bench
+0.17n = 12
SWE-Verified
+0.11n = 12
HLE
+0.10n = 13
Pearson r: −1 to +1. Positive means the two benchmarks rank models in similar order; negative means the opposite.

What It Captures Well

Broad coverage across 14 fields, so it is a reasonable single number for general capability.
10-answer format reduces guessing noise and widens the gap between models.
Hand-cleaned to remove badly-worded items that plagued the original MMLU.

Where It Falls Short

Still mostly multiple choice, so it under-measures generative reasoning quality.
Some subjects are over-represented in the training data of common base models, which inflates scores.
Does not test long-context, code, or agentic ability.

Frequently Asked Questions

How is MMLU-PRO different from MMLU?

What MMLU-PRO score should I expect from a strong open model?

Strong open-weight models in the 30B–70B range typically score 60–75%. The very best frontier models in 2026 are above 85%. Anything under 50% is well behind the field on general knowledge tasks.

Does a high MMLU-PRO score mean a model is good at coding?

Is MMLU-PRO still possible to game?

Related Benchmarks

Based on score correlations across our database.

Pearson r +0.88

Picking the Right Model for Your Use Case?

We help product and engineering teams turn benchmark scores into shipped features. Free first conversation.

MMLU-PRO: Massive Multitask Language Understanding, Professional

How It Works

Top Scorers

Score Distribution

Open vs Closed Source

Score vs Parameter Count

Average Score by Lab

Most Correlated Benchmarks

What It Captures Well

Where It Falls Short

Frequently Asked Questions

Related Benchmarks

AIME 2026

GPQA

SWE-Pro

Arena Score

Picking the Right Model for Your Use Case?

MMLU-PRO: Massive Multitask Language Understanding, Professional

How It Works

Top Scorers

Score Distribution

Open vs Closed Source

Score vs Parameter Count

Average Score by Lab

Most Correlated Benchmarks

What It Captures Well

Where It Falls Short

Frequently Asked Questions

Related Benchmarks

AIME 2026

GPQA

SWE-Pro

Arena Score

Picking the Right Model for Your Use Case?