Expert-written science questions that PhD researchers can barely solve and Google searches cannot answer.
GPQA tests whether a model can reason through hard graduate-level science problems on its own. The questions are written by domain PhDs and explicitly checked to be "Google-proof," which means a smart non-expert with 30 minutes and a search engine still cannot solve them. That setup measures real subject-matter reasoning, not retrieval or pattern matching against the open web.
Each question is multiple choice with one correct answer and a fixed set of distractors. Models are evaluated zero-shot or with a short reasoning prompt. Scores are reported as percent of questions answered correctly. The harder "Diamond" subset of 198 questions is the slice most labs publish numbers on.
A particle is in a 1D infinite potential well of width L. If the particle is in the ground state, what is the probability of finding it in the region between L/4 and 3L/4?
| # | Model | Lab | Source | Score |
|---|---|---|---|---|
| 01 | Gemini 3.1 Pro Preview | Closed | 94.1 | |
| 02 | GPT-5.5 | OpenAI | Closed | 93.5 |
| 03 | MiniMax M3 | MiniMax | Open | 92.9 |
| 04 | Qwen 3.7 Max | Alibaba | Closed | 92.3 |
| 05 | Gemini 3.5 Flash | Closed | 92.2 | |
| 06 | GPT-5.4 | OpenAI | Closed | 92.0 |
| 07 | GPT-5.4 High | OpenAI | Closed | 92.0 |
| 08 | Claude Opus 4.7 Thinking | Anthropic | Closed | 91.4 |
| 09 | Claude Opus 4.7 | Anthropic | Closed | 91.4 |
| 10 | Kimi K2.6 | Moonshot AI | Open | 91.1 |
| 11 | Gemini 3 Pro | Closed | 90.8 | |
| 12 | GPT-5.2 | OpenAI | Closed | 90.3 |
| 13 | GPT-5.2 High | OpenAI | Closed | 90.3 |
| 14 | Grok 4.3 | xAI | Closed | 90.1 |
| 15 | Grok 4.3 beta | xAI | Closed | 90.1 |
76 model(s) with undisclosed parameter counts not shown. Most closed-source labs do not publish model size.
A human PhD in the relevant field scores about 65% on the Diamond subset, while a non-expert with Google scores around 34%. Frontier models in 2026 are in the 80–90% range; mid-tier open-source models land between 45% and 65%.
Every question was checked by non-expert validators who had open web access and 30 minutes per question. Only questions that the validators could not solve made it into the final dataset, so a model has to reason rather than retrieve.
No. MMLU is broad and shallow: high school through college knowledge across 57 subjects. GPQA is narrow and deep: graduate-level reasoning in three hard sciences. Strong MMLU scores do not guarantee strong GPQA scores.
Partially. Subject knowledge sets a floor, but the questions require multi-step reasoning that textbook memorization alone cannot solve. The best scores correlate with chain-of-thought ability, not training corpus size.
Based on score correlations across our database.