Five hundred curated competition math problems used as a fast, repeatable test of mathematical reasoning.
MATH-500 is the curated 500-problem subset most labs report on when measuring step-by-step mathematical reasoning. The problems come from US high-school and undergraduate competitions, covering algebra, geometry, number theory, and combinatorics. Solutions require multi-step reasoning, not memorization.
Each problem has a single short-form answer. The model produces a final expression or number, scored by exact match against the reference. Most leaderboards report pass@1 with chain-of-thought, sometimes with majority voting across multiple samples.
| # | Model | Lab | Source | Score |
|---|---|---|---|---|
| 01 | GPT-5 High | OpenAI | Closed | 99.4 |
| 02 | Grok 3 Mini High | xAI | Closed | 99.2 |
| 03 | OpenAI o3 | OpenAI | Closed | 99.2 |
| 04 | Claude Sonnet 4 (Thinking 32K) | Anthropic | Closed | 99.1 |
| 05 | Grok 4 (0709) | xAI | Closed | 99.0 |
| 06 | OpenAI o4-mini | OpenAI | Closed | 98.9 |
| 07 | OpenAI o3-mini High | OpenAI | Closed | 98.5 |
| 08 | DeepSeek-R1 | DeepSeek | Open | 98.3 |
| 09 | Claude Opus 4 (Thinking 16K) | Anthropic | Closed | 98.2 |
| 10 | GLM-4.5 | Z.ai | Open | 97.9 |
| 11 | OpenAI o3-mini | OpenAI | Closed | 97.3 |
| 12 | Kimi K2 Instruct | Moonshot AI | Open | 97.1 |
| 13 | OpenAI o1 | OpenAI | Closed | 97.0 |
| 14 | Gemini 2.5 Flash Lite Preview 06-17 (Thinking) | Closed | 96.9 | |
| 15 | Gemini 2.5 Pro | Closed | 96.7 |
28 model(s) with undisclosed parameter counts not shown. Most closed-source labs do not publish model size.
No. MATH is the full 12,500-problem benchmark. MATH-500 is a curated subset of 500 problems chosen to be representative while staying fast to evaluate.
Based on score correlations across our database.