Five hundred curated competition math problems used as a fast, repeatable test of mathematical reasoning.
MATH-500 is the curated 500-problem subset most labs report on when measuring step-by-step mathematical reasoning. The problems come from US high-school and undergraduate competitions, covering algebra, geometry, number theory, and combinatorics. Solutions require multi-step reasoning, not memorization.
Each problem has a single short-form answer. The model produces a final expression or number, scored by exact match against the reference. Most leaderboards report pass@1 with chain-of-thought, sometimes with majority voting across multiple samples.
| # | Model | Lab | Source | Score |
|---|---|---|---|---|
| 01 | DeepSeek-R1 | DeepSeek | Open | 98.3 |
| 02 | Gemini 2.5 Pro | Closed | 96.7 | |
| 03 | Gemini 2.5 Flash | Closed | 93.2 | |
| 04 | Gemini 2.0 Flash | Closed | 93.0 | |
| 05 | GPT-4.1 Mini | OpenAI | Closed | 92.5 |
| 06 | GPT-4.1 | OpenAI | Closed | 91.3 |
| 07 | Llama 4 Maverick | Meta | Open | 88.9 |
8 model(s) with undisclosed parameter counts not shown. Most closed-source labs do not publish model size.
No. MATH is the full 12,500-problem benchmark. MATH-500 is a curated subset of 500 problems chosen to be representative while staying fast to evaluate.
Based on score correlations across our database.
| 08 | DeepSeek-V3 | DeepSeek | Open | 88.7 |
| 09 | Gemini 2.0 Flash Lite Preview | Closed | 87.3 |
| 10 | Llama 4 Scout | Meta | Open | 84.4 |
| 11 | Qwen2.5 Max | Alibaba | Closed | 83.5 |
| 12 | GPT-4o | OpenAI | Closed | 75.9 |
| 13 | Mistral Large 2 | Mistral AI | Open | 73.6 |
| 14 | Mixtral 8x22B Instruct | Mistral AI | Open | 54.5 |
| 15 | Mixtral 8x7B Instruct | Mistral AI | Open | 29.9 |