Benchmarks · 2026

AIME 2026: American Invitational Mathematics Examination 2026

Name: AIME 2026: American Invitational Mathematics Examination 2026
Creator: Mathematical Association of America, via MathArena
Published: 2026
Keywords: AIME 2026, AI benchmark, text model evaluation, Mathematical Association of America, via MathArena

Fifteen elite high-school competition math problems used as a yearly stress test for chain-of-thought reasoning.

Open Dataset

Models Tested

Top Score

96.4

Published

2026

Source

Mathematical Association of America, via MathArena

How It Works

The AIME is the qualifier for the US national math olympiad. Each problem demands creative algebra, geometry, number theory, or combinatorics, and the answer is always an integer between 0 and 999. Because the contest is brand new each year, a fresh AIME set is one of the cleanest tests of genuine mathematical reasoning. There is no chance the model saw the answers in training.

Models are asked to solve each problem and produce a single integer answer. Scoring is percent of correct integers. Most leaderboards run multiple samples per problem and report majority vote (pass@1 with self-consistency).

Dataset size

15 problems from the 2026 AIME I and II contests, each with a positive integer answer from 0 to 999.

Mean score

94.0

Median score

94.7

Open / Closed

10 / 0

Top Scorers

#	Model	Lab	Source	Score
01	Kimi K2.6	Moonshot AI	Open	96.4
02	Ring-2.6-1T	inclusionAI	Open	95.8
03	GLM-5	Z.ai	Open	95.8
04	Kimi K2.5	Moonshot AI	Open	95.8
05	GLM-5.1	Z.ai	Open	95.3
06	Qwen3.6-27B	Alibaba	Open	94.1
07	Qwen3.5-397B-A17B	Alibaba	Open	93.3
08	Qwen3.6 35B-A3B	Alibaba	Open	92.7
09	Qwen3.5-27B	Alibaba	Open	90.8
10	Nvidia Nemotron 3 Super	NVIDIA	Open	90.0

Score Distribution

Open vs Closed Source

Top Open-Source Models

1Kimi K2.696.4
2Ring-2.6-1T95.8
3GLM-595.8

Top Closed-Source Models

No models in this category.

Score vs Parameter Count

Average Score by Lab

Moonshot AI
96.1n = 2
Z.ai
95.6n = 2
Alibaba
92.7n = 4

Most Correlated Benchmarks

SciCode
+0.80n = 10
AA Intelligence Index
+0.74n = 10
Terminal Bench
+0.73n = 9
SWE-Verified
+0.64n = 9
HLE
+0.59n = 10
HMMT 2026
+0.57n = 9
Terminal Bench Hard
+0.54n = 10
GPQA
+0.50n = 10
AA LCR
+0.31n = 10
IFBench
-0.21n = 10
Pearson r: −1 to +1. Positive means the two benchmarks rank models in similar order; negative means the opposite.

What It Captures Well

New each year, so the score reflects reasoning rather than memorization.
Compact, fast to evaluate, and well-suited to chain-of-thought analysis.
High ceiling: top frontier models score 80–95%, mid-tier open models score 30–60%.

Where It Falls Short

Only 15 problems, so a single lucky or unlucky problem swings the score by 7 points.
Very narrow domain: high-school competition math, not applied math or proof writing.
Self-consistency sampling masks raw single-shot ability.

Frequently Asked Questions

Why use AIME instead of older math benchmarks?

Older benchmarks like GSM8K and MATH are saturated and partially leaked into training data. AIME 2026 is released after training cutoffs for most current models, so the score is a clean read on reasoning.

What is the difference between AIME and HMMT?

Both are elite high-school competition math sets. HMMT is generally harder, with fewer per-problem guessing tricks. A model that scores 50% on AIME often scores 25–35% on HMMT.

How many tries does each model get?

Leaderboards vary. Some report single-shot accuracy, others report majority vote over 32 or 64 samples. Higher sample counts boost scores by 10–25 points on the same model, so check the methodology before comparing.

Related Benchmarks

Based on score correlations across our database.

Pearson r +0.80

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

Benchmarks · 2026

AIME 2026: American Invitational Mathematics Examination 2026

Fifteen elite high-school competition math problems used as a yearly stress test for chain-of-thought reasoning.

Open Dataset

Models Tested

Top Score

96.4

Published

2026

Source

Mathematical Association of America, via MathArena

How It Works

Dataset size

15 problems from the 2026 AIME I and II contests, each with a positive integer answer from 0 to 999.

Mean score

94.0

Median score

94.7

Open / Closed

10 / 0

Top Scorers

#	Model	Lab	Source	Score
01	Kimi K2.6	Moonshot AI	Open	96.4
02	Ring-2.6-1T	inclusionAI	Open	95.8
03	GLM-5	Z.ai	Open	95.8
04	Kimi K2.5	Moonshot AI	Open	95.8
05	GLM-5.1	Z.ai	Open	95.3
06	Qwen3.6-27B	Alibaba	Open	94.1
07	Qwen3.5-397B-A17B	Alibaba	Open	93.3
08	Qwen3.6 35B-A3B	Alibaba	Open	92.7
09	Qwen3.5-27B	Alibaba	Open	90.8
10	Nvidia Nemotron 3 Super	NVIDIA	Open	90.0

Score Distribution

Open vs Closed Source

Top Open-Source Models

1Kimi K2.696.4
2Ring-2.6-1T95.8
3GLM-595.8

Top Closed-Source Models

No models in this category.

Score vs Parameter Count

Average Score by Lab

Moonshot AI
96.1n = 2
Z.ai
95.6n = 2
Alibaba
92.7n = 4

Most Correlated Benchmarks

SciCode
+0.80n = 10
AA Intelligence Index
+0.74n = 10
Terminal Bench
+0.73n = 9
SWE-Verified
+0.64n = 9
HLE
+0.59n = 10
HMMT 2026
+0.57n = 9
Terminal Bench Hard
+0.54n = 10
GPQA
+0.50n = 10
AA LCR
+0.31n = 10
IFBench
-0.21n = 10
Pearson r: −1 to +1. Positive means the two benchmarks rank models in similar order; negative means the opposite.

What It Captures Well

New each year, so the score reflects reasoning rather than memorization.
Compact, fast to evaluate, and well-suited to chain-of-thought analysis.
High ceiling: top frontier models score 80–95%, mid-tier open models score 30–60%.

Where It Falls Short

Only 15 problems, so a single lucky or unlucky problem swings the score by 7 points.
Very narrow domain: high-school competition math, not applied math or proof writing.
Self-consistency sampling masks raw single-shot ability.

Frequently Asked Questions

Why use AIME instead of older math benchmarks?

What is the difference between AIME and HMMT?

Both are elite high-school competition math sets. HMMT is generally harder, with fewer per-problem guessing tricks. A model that scores 50% on AIME often scores 25–35% on HMMT.

How many tries does each model get?

Related Benchmarks

Based on score correlations across our database.

Pearson r +0.80

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

AIME 2026: American Invitational Mathematics Examination 2026

How It Works

Top Scorers

Score Distribution

Open vs Closed Source

Score vs Parameter Count

Average Score by Lab

Most Correlated Benchmarks

What It Captures Well

Where It Falls Short

Frequently Asked Questions

Related Benchmarks

SciCode

AA Intelligence Index

Terminal Bench

SWE-Verified

The AI Build Report

AIME 2026: American Invitational Mathematics Examination 2026

How It Works

Top Scorers

Score Distribution

Open vs Closed Source

Score vs Parameter Count

Average Score by Lab

Most Correlated Benchmarks

What It Captures Well

Where It Falls Short

Frequently Asked Questions

Related Benchmarks

SciCode

AA Intelligence Index

Terminal Bench

SWE-Verified

The AI Build Report