Benchmarks · 2021

MATH-500: MATH-500 Competition Math Subset

Name: MATH-500: MATH-500 Competition Math Subset
Creator: OpenAI (subset of the MATH dataset by Hendrycks et al.)
Published: 2021
Keywords: MATH-500, AI benchmark, text model evaluation, OpenAI (subset of the MATH dataset by Hendrycks et al.)

Five hundred curated competition math problems used as a fast, repeatable test of mathematical reasoning.

Open Dataset Read Paper

Models Tested

Top Score

99.4

Published

2021

Source

OpenAI (subset of the MATH dataset by Hendrycks et al.)

How It Works

MATH-500 is the curated 500-problem subset most labs report on when measuring step-by-step mathematical reasoning. The problems come from US high-school and undergraduate competitions, covering algebra, geometry, number theory, and combinatorics. Solutions require multi-step reasoning, not memorization.

Each problem has a single short-form answer. The model produces a final expression or number, scored by exact match against the reference. Most leaderboards report pass@1 with chain-of-thought, sometimes with majority voting across multiple samples.

Dataset size

500 problems sampled from the MATH benchmark, balanced across topic and difficulty.

Mean score

80.5

Median score

89.3

Open / Closed

25 / 28

Top Scorers

#	Model	Lab	Source	Score
01	GPT-5 High	OpenAI	Closed	99.4
02	Grok 3 Mini High	xAI	Closed	99.2
03	OpenAI o3	OpenAI	Closed	99.2
04	Claude Sonnet 4 (Thinking 32K)	Anthropic	Closed	99.1
05	Grok 4 (0709)	xAI	Closed	99.0
06	OpenAI o4-mini	OpenAI	Closed	98.9
07	OpenAI o3-mini High	OpenAI	Closed	98.5
08	DeepSeek-R1	DeepSeek	Open	98.3
09	Claude Opus 4 (Thinking 16K)	Anthropic	Closed	98.2
10	GLM-4.5	Z.ai	Open	97.9
11	OpenAI o3-mini	OpenAI	Closed	97.3
12	Kimi K2 Instruct	Moonshot AI	Open	97.1
13	OpenAI o1	OpenAI	Closed	97.0
14	Gemini 2.5 Flash Lite Preview 06-17 (Thinking)	Google	Closed	96.9
15	Gemini 2.5 Pro	Google	Closed	96.7

Score Distribution

Open vs Closed Source

Gap on MATH-500:+1.1pts closed leads

Top Open-Source Models

1DeepSeek-R198.3
2GLM-4.597.9
3Kimi K2 Instruct97.1

Top Closed-Source Models

1GPT-5 High99.4
2Grok 3 Mini High99.2
3OpenAI o399.2

Score vs Parameter Count

28 model(s) with undisclosed parameter counts not shown. Most closed-source labs do not publish model size.

Average Score by Lab

Anthropic
95.9n = 5
OpenAI
93.8n = 12
DeepSeek
93.5n = 2
xAI
90.8n = 4
Google
90.3n = 7
Alibaba
86.7n = 4
Meta
55.2n = 11
Mistral AI
48.3n = 5

Most Correlated Benchmarks

MMLU-PRO
+0.95n = 51
Arena Score
+0.93n = 53
GPQA
+0.88n = 52
SciCode
+0.86n = 51
LiveCodeBench
+0.79n = 52
AA Intelligence Index
+0.71n = 53
AA LCR
+0.68n = 41
IFBench
+0.64n = 41
Terminal Bench Hard
+0.60n = 38
HLE
+0.40n = 52
Pearson r: −1 to +1. Positive means the two benchmarks rank models in similar order; negative means the opposite.

What It Captures Well

Fast and cheap to run, with stable scoring.
Covers a broad range of competition math topics.
Long enough to give a stable signal across model generations.

Where It Falls Short

The full MATH set is partially memorized by some models, which inflates scores.
Short-answer scoring penalizes correct reasoning that arrives at the wrong final form.
Older than AIME 2025; less contamination-resistant.

Frequently Asked Questions

Is MATH-500 the same as MATH?

No. MATH is the full 12,500-problem benchmark. MATH-500 is a curated subset of 500 problems chosen to be representative while staying fast to evaluate.

Related Benchmarks

Based on score correlations across our database.

Pearson r +0.95

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

Benchmarks · 2021

MATH-500: MATH-500 Competition Math Subset

Five hundred curated competition math problems used as a fast, repeatable test of mathematical reasoning.

Open Dataset Read Paper

Models Tested

Top Score

99.4

Published

2021

Source

OpenAI (subset of the MATH dataset by Hendrycks et al.)

How It Works

Dataset size

500 problems sampled from the MATH benchmark, balanced across topic and difficulty.

Mean score

80.5

Median score

89.3

Open / Closed

25 / 28

Top Scorers

#	Model	Lab	Source	Score
01	GPT-5 High	OpenAI	Closed	99.4
02	Grok 3 Mini High	xAI	Closed	99.2
03	OpenAI o3	OpenAI	Closed	99.2
04	Claude Sonnet 4 (Thinking 32K)	Anthropic	Closed	99.1
05	Grok 4 (0709)	xAI	Closed	99.0
06	OpenAI o4-mini	OpenAI	Closed	98.9
07	OpenAI o3-mini High	OpenAI	Closed	98.5
08	DeepSeek-R1	DeepSeek	Open	98.3
09	Claude Opus 4 (Thinking 16K)	Anthropic	Closed	98.2
10	GLM-4.5	Z.ai	Open	97.9
11	OpenAI o3-mini	OpenAI	Closed	97.3
12	Kimi K2 Instruct	Moonshot AI	Open	97.1
13	OpenAI o1	OpenAI	Closed	97.0
14	Gemini 2.5 Flash Lite Preview 06-17 (Thinking)	Google	Closed	96.9
15	Gemini 2.5 Pro	Google	Closed	96.7

Score Distribution

Open vs Closed Source

Gap on MATH-500:+1.1pts closed leads

Top Open-Source Models

1DeepSeek-R198.3
2GLM-4.597.9
3Kimi K2 Instruct97.1

Top Closed-Source Models

1GPT-5 High99.4
2Grok 3 Mini High99.2
3OpenAI o399.2

Score vs Parameter Count

28 model(s) with undisclosed parameter counts not shown. Most closed-source labs do not publish model size.

Average Score by Lab

Anthropic
95.9n = 5
OpenAI
93.8n = 12
DeepSeek
93.5n = 2
xAI
90.8n = 4
Google
90.3n = 7
Alibaba
86.7n = 4
Meta
55.2n = 11
Mistral AI
48.3n = 5

Most Correlated Benchmarks

MMLU-PRO
+0.95n = 51
Arena Score
+0.93n = 53
GPQA
+0.88n = 52
SciCode
+0.86n = 51
LiveCodeBench
+0.79n = 52
AA Intelligence Index
+0.71n = 53
AA LCR
+0.68n = 41
IFBench
+0.64n = 41
Terminal Bench Hard
+0.60n = 38
HLE
+0.40n = 52
Pearson r: −1 to +1. Positive means the two benchmarks rank models in similar order; negative means the opposite.

What It Captures Well

Fast and cheap to run, with stable scoring.
Covers a broad range of competition math topics.
Long enough to give a stable signal across model generations.

Where It Falls Short

The full MATH set is partially memorized by some models, which inflates scores.
Short-answer scoring penalizes correct reasoning that arrives at the wrong final form.
Older than AIME 2025; less contamination-resistant.

Frequently Asked Questions

Is MATH-500 the same as MATH?

No. MATH is the full 12,500-problem benchmark. MATH-500 is a curated subset of 500 problems chosen to be representative while staying fast to evaluate.

Related Benchmarks

Based on score correlations across our database.

Pearson r +0.95

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

MATH-500: MATH-500 Competition Math Subset

How It Works

Top Scorers

Score Distribution

Open vs Closed Source

Score vs Parameter Count

Average Score by Lab

Most Correlated Benchmarks

What It Captures Well

Where It Falls Short

Frequently Asked Questions

Related Benchmarks

MMLU-PRO

Arena Score

GPQA

SciCode

The AI Build Report

MATH-500: MATH-500 Competition Math Subset

How It Works

Top Scorers

Score Distribution

Open vs Closed Source

Score vs Parameter Count

Average Score by Lab

Most Correlated Benchmarks

What It Captures Well

Where It Falls Short

Frequently Asked Questions

Related Benchmarks

MMLU-PRO

Arena Score

GPQA

SciCode

The AI Build Report