A harder, harder-to-game replacement for the original MMLU, covering reasoning across 14 academic and professional subjects.
MMLU-PRO is a re-engineered version of the older MMLU benchmark. It keeps the broad coverage — math, law, engineering, medicine, business, philosophy — but raises the difficulty floor and replaces the easy four-choice format with ten choices per question. The bigger answer space cuts random-guess scores from 25% down to 10%, so the gap between strong and weak models is much more visible.
Models are evaluated zero-shot or with chain-of-thought. Each question has one correct answer and nine distractors. Scoring is percent correct across the full 12K set, with per-subject breakdowns. The dataset deliberately removes the noisiest, most-memorized questions from the original MMLU and adds harder reasoning items pulled from textbooks and exams.
| # | Model | Lab | Source | Score |
|---|---|---|---|---|
| 01 | Gemini 3 Flash | Closed | 88.6 | |
| 02 | Qwen3.5-397B-A17B | Alibaba | Open | 87.8 |
| 03 | DeepSeek-V4-Pro | DeepSeek | Open | 87.5 |
| 04 | Kimi K2.5 | Moonshot AI | Open | 87.1 |
| 05 | DeepSeek-V4-Flash | DeepSeek | Open | 86.4 |
| 06 | Qwen3.6-27B | Alibaba | Open | 86.2 |
| 07 | Qwen3.6 35B-A3B | Alibaba | Open | 85.2 |
| 08 | Gemma 4 31B IT | Open | 85.2 | |
| 09 | DeepSeek-V3.2 | DeepSeek | Open | 85.0 |
| 10 | DeepSeek-R1 | DeepSeek | Open | 84.0 |
| 11 | Nvidia Nemotron 3 Super | NVIDIA | Open | 83.7 |
| 12 | Gemma 4 26B-A4B IT | Open | 82.6 | |
| 13 | Qwen3.5-9B | Alibaba | Open | 82.5 |
| 14 | GPT-5.2 | OpenAI | Closed | 80.0 |
| 15 | Claude Opus 4.6 | Anthropic | Closed | 78.5 |
4 model(s) with undisclosed parameter counts not shown. Most closed-source labs do not publish model size.
MMLU-PRO has harder questions, ten answer choices instead of four, and a curated set that removes weak items from the original. Top models drop 10–20 points compared to MMLU because there is less room to guess and less room to coast on memorization.
Strong open-weight models in the 30B–70B range typically score 60–75%. The very best frontier models in 2026 are above 85%. Anything under 50% is well behind the field on general knowledge tasks.
Not directly. MMLU-PRO measures broad reasoning and recall. For coding ability, look at SWE-Verified and SWE-Pro. The two correlate, but specialized code models can score modestly on MMLU-PRO and very well on the SWE family.
Harder than the original, but not impossible. Some training datasets include MMLU-PRO derivatives, which inflates scores. Pair it with GPQA and HLE to spot models that look strong only because the test leaked.
Based on score correlations across our database.