Benchmarks · 2021

GSM8K: Grade School Math 8K

Name: GSM8K: Grade School Math 8K
Creator: OpenAI
Published: 2021
Keywords: GSM8K, AI benchmark, text model evaluation, OpenAI

Eighty-five hundred word problems that test whether a model can do multi-step arithmetic reasoning, not just recall.

Open Dataset Read Paper

Models Tested

Top Score

92.6

Published

2021

Source

OpenAI

How It Works

GSM8K is the longest-running math reasoning benchmark in modern LLM evaluation. The problems use only basic arithmetic, but each one needs two to eight steps of chained reasoning. That makes it a clean test for whether a model can follow a logical chain without dropping a number or making a sign error.

A model reads the problem and produces a final numerical answer. The official scoring is exact-match on the number, so showing the work does not earn partial credit. Most leaderboards report the chain-of-thought zero-shot score; some also report self-consistency with majority voting across multiple samples.

Dataset size

8,500 grade-school-level math word problems with full solutions.

Mean score

88.8

Median score

89.3

Open / Closed

3 / 0

Sample Question

Janet has 3 children. Each child has 5 marbles. She buys 2 more bags of marbles, each with 8 marbles, and splits the new marbles equally among her children. How many marbles does each child have now?

Top Scorers

#	Model	Lab	Source	Score
01	DeepSeek-V4-Pro	DeepSeek	Open	92.6
02	DeepSeek-V3	DeepSeek	Open	89.3
03	Llama 3.1 8B Instruct	Meta	Open	84.5

Score Distribution

Open vs Closed Source

Top Open-Source Models

1DeepSeek-V4-Pro92.6
2DeepSeek-V389.3
3Llama 3.1 8B Instruct84.5

Top Closed-Source Models

No models in this category.

Score vs Parameter Count

Average Score by Lab

DeepSeek
90.9n = 2

Most Correlated Benchmarks

Not enough scored models yet.

What It Captures Well

A clean signal for multi-step reasoning quality without needing advanced math knowledge.
Long history of comparable scores across model generations, useful for tracking progress.
Cheap to run, with predictable output format.

Where It Falls Short

Nearly saturated — frontier models score above 95%, leaving little room to differentiate top performers.
Word problems are formulaic, so models can memorize the template structure.
Does not test higher math, symbolic algebra, or proof writing.

Frequently Asked Questions

Is GSM8K still a useful benchmark in 2026?

For frontier models, no — they all score above 95% and the test cannot tell them apart. For small and mid-tier open-source models, yes. Anything that scores below 80% on GSM8K is unlikely to handle multi-step business logic reliably.

Why do GSM8K scores improve so much with chain-of-thought?

The arithmetic is easy but the chains are long. Without explicit step-by-step output, models drop intermediate values. Asking for reasoning before the final answer can double scores on smaller models.

What is the difference between GSM8K and AIME or HMMT?

GSM8K is grade-school arithmetic with long chains. AIME and HMMT are competition math with algebra, geometry, and number theory. Strong GSM8K does not predict competition math performance.

Related Benchmarks

Based on score correlations across our database.

Pearson r —

GPQA

n = 2

Picking the Right Model for Your Use Case?

We help product and engineering teams turn benchmark scores into shipped features. Free first conversation.