Expert-written science questions that PhD researchers can barely solve and Google searches cannot answer.
GPQA tests whether a model can reason through hard graduate-level science problems on its own. The questions are written by domain PhDs and explicitly checked to be "Google-proof," which means a smart non-expert with 30 minutes and a search engine still cannot solve them. That setup measures real subject-matter reasoning, not retrieval or pattern matching against the open web.
Each question is multiple choice with one correct answer and a fixed set of distractors. Models are evaluated zero-shot or with a short reasoning prompt. Scores are reported as percent of questions answered correctly. The harder "Diamond" subset of 198 questions is the slice most labs publish numbers on.
A particle is in a 1D infinite potential well of width L. If the particle is in the ground state, what is the probability of finding it in the region between L/4 and 3L/4?
| # | Model | Lab | Source | Score |
|---|---|---|---|---|
| 01 | GPT-5.2 | OpenAI | Closed | 93.2 |
| 02 | GPT-5.4 | OpenAI | Closed | 92.8 |
| 03 | Gemini 3 Pro | Closed | 91.9 | |
| 04 | Claude Opus 4.6 | Anthropic | Closed | 91.3 |
| 05 | Kimi K2.6 | Moonshot AI | Open | 90.5 |
| 06 | Gemini 3 Flash | Closed | 90.4 | |
| 07 | DeepSeek-V4-Pro | DeepSeek |
9 model(s) with undisclosed parameter counts not shown. Most closed-source labs do not publish model size.
A human PhD in the relevant field scores about 65% on the Diamond subset, while a non-expert with Google scores around 34%. Frontier models in 2026 are in the 80–90% range; mid-tier open-source models land between 45% and 65%.
Every question was checked by non-expert validators who had open web access and 30 minutes per question. Only questions that the validators could not solve made it into the final dataset, so a model has to reason rather than retrieve.
No. MMLU is broad and shallow — high school through college knowledge across 57 subjects. GPQA is narrow and deep — graduate-level reasoning in three hard sciences. Strong MMLU scores do not guarantee strong GPQA scores.
Partially. Subject knowledge sets a floor, but the questions require multi-step reasoning that textbook memorization alone cannot solve. The best scores correlate with chain-of-thought ability, not training corpus size.
| Open |
| 90.1 |
| 08 | Claude Sonnet 4.6 | Anthropic | Closed | 89.9 |
| 09 | Qwen3.5-397B-A17B | Alibaba | Open | 88.4 |
| 10 | DeepSeek-V4-Flash | DeepSeek | Open | 88.1 |
| 11 | GPT-5.1 | OpenAI | Closed | 88.1 |
| 12 | Qwen3.6-27B | Alibaba | Open | 87.8 |
| 13 | Kimi K2.5 | Moonshot AI | Open | 87.6 |
| 14 | Qwen3.5-122B-A10B | Alibaba | Open | 86.6 |
| 15 | GLM-5.1 | Z.ai | Open | 86.2 |