Eighty-five hundred word problems that test whether a model can do multi-step arithmetic reasoning, not just recall.
GSM8K is the longest-running math reasoning benchmark in modern LLM evaluation. The problems use only basic arithmetic, but each one needs two to eight steps of chained reasoning. That makes it a clean test for whether a model can follow a logical chain without dropping a number or making a sign error.
A model reads the problem and produces a final numerical answer. The official scoring is exact-match on the number, so showing the work does not earn partial credit. Most leaderboards report the chain-of-thought zero-shot score; some also report self-consistency with majority voting across multiple samples.
Janet has 3 children. Each child has 5 marbles. She buys 2 more bags of marbles, each with 8 marbles, and splits the new marbles equally among her children. How many marbles does each child have now?
| # | Model | Lab | Source | Score |
|---|---|---|---|---|
| 01 | DeepSeek-V4-Pro | DeepSeek | Open | 92.6 |
| 02 | DeepSeek-V3 | DeepSeek | Open | 89.3 |
| 03 | Llama 3.1 8B Instruct | Meta | Open | 84.5 |
No models in this category.
Not enough scored models yet.
For frontier models, no — they all score above 95% and the test cannot tell them apart. For small and mid-tier open-source models, yes. Anything that scores below 80% on GSM8K is unlikely to handle multi-step business logic reliably.
The arithmetic is easy but the chains are long. Without explicit step-by-step output, models drop intermediate values. Asking for reasoning before the final answer can double scores on smaller models.
GSM8K is grade-school arithmetic with long chains. AIME and HMMT are competition math with algebra, geometry, and number theory. Strong GSM8K does not predict competition math performance.