Eighty-five hundred word problems that test whether a model can do multi-step arithmetic reasoning, not just recall.
GSM8K is the longest-running math reasoning benchmark in modern LLM evaluation. The problems use only basic arithmetic, but each one needs two to eight steps of chained reasoning. That makes it a clean test for whether a model can follow a logical chain without dropping a number or making a sign error.
A model reads the problem and produces a final numerical answer. The official scoring is exact-match on the number, so showing the work does not earn partial credit. Most leaderboards report the chain-of-thought zero-shot score; some also report self-consistency with majority voting across multiple samples.
Janet has 3 children. Each child has 5 marbles. She buys 2 more bags of marbles, each with 8 marbles, and splits the new marbles equally among her children. How many marbles does each child have now?
| # | Model | Lab | Source | Score |
|---|---|---|---|---|
| 01 | DeepSeek-V4-Pro | DeepSeek | Open | 92.6 |
| 02 | DeepSeek-V3 | DeepSeek | Open | 89.3 |
No models in this category.
Not enough scored models yet.
For frontier models, no, they all score above 95% and the test cannot tell them apart. For small and mid-tier open-source models, yes. Anything that scores below 80% on GSM8K is unlikely to handle multi-step business logic reliably.
The arithmetic is easy but the chains are long. Without explicit step-by-step output, models drop intermediate values. Asking for reasoning before the final answer can double scores on smaller models.
GSM8K is grade-school arithmetic with long chains. AIME and HMMT are competition math with algebra, geometry, and number theory. Strong GSM8K does not predict competition math performance.
Based on score correlations across our database.