Benchmarks · 2023

GPQA: Graduate-Level Google-Proof Q&A

Name: GPQA: Graduate-Level Google-Proof Q&A
Creator: New York University and Anthropic
Published: 2023
Keywords: GPQA, AI benchmark, text model evaluation, New York University and Anthropic

Expert-written science questions that PhD researchers can barely solve and Google searches cannot answer.

Open Dataset Read Paper

Models Tested

Top Score

93.2

Published

2023

Source

New York University and Anthropic

How It Works

GPQA tests whether a model can reason through hard graduate-level science problems on its own. The questions are written by domain PhDs and explicitly checked to be "Google-proof," which means a smart non-expert with 30 minutes and a search engine still cannot solve them. That setup measures real subject-matter reasoning, not retrieval or pattern matching against the open web.

Each question is multiple choice with one correct answer and a fixed set of distractors. Models are evaluated zero-shot or with a short reasoning prompt. Scores are reported as percent of questions answered correctly. The harder "Diamond" subset of 198 questions is the slice most labs publish numbers on.

Dataset size

448 multiple-choice questions across biology, physics, and chemistry.

Mean score

81.9

Median score

85.9

Open / Closed

25 / 9

Sample Question

A particle is in a 1D infinite potential well of width L. If the particle is in the ground state, what is the probability of finding it in the region between L/4 and 3L/4?

Top Scorers

#	Model	Lab	Source	Score
01	GPT-5.2	OpenAI	Closed	93.2
02	GPT-5.4	OpenAI	Closed	92.8
03	Gemini 3 Pro	Google	Closed	91.9
04	Claude Opus 4.6	Anthropic	Closed	91.3
05	Kimi K2.6	Moonshot AI	Open	90.5
06	Gemini 3 Flash	Google	Closed	90.4
07	DeepSeek-V4-Pro	DeepSeek

Score Distribution

Open vs Closed Source

Gap on GPQA:+2.7pts closed leads

Top Open-Source Models

1Kimi K2.690.5
2DeepSeek-V4-Pro90.1
3Qwen3.5-397B-A17B88.4

Top Closed-Source Models

1GPT-5.293.2
2GPT-5.492.8
3Gemini 3 Pro91.9

Score vs Parameter Count

9 model(s) with undisclosed parameter counts not shown. Most closed-source labs do not publish model size.

Average Score by Lab

OpenAI
91.4n = 3
Moonshot AI
87.5n = 3
Z.ai
86.0n = 3
Alibaba
85.7n = 7
Anthropic
84.9n = 4

Most Correlated Benchmarks

AIME 2026
+0.92n = 24
SWE-Pro
+0.87n = 13
Arena Score
+0.86n = 27
MMLU-PRO
+0.85n = 18
SWE-Verified

What It Captures Well

Designed so leakage from web pages does not help, which keeps the score honest as training data grows.
Tests deep reasoning in narrow technical domains, where memorization is the weakest signal.
Used by every major lab, so scores are directly comparable across model families.

Where It Falls Short

Small dataset, so a few hundred extra answers can swing the score by several points.
Heavy on physics and chemistry, light on the social sciences and applied work.
Multiple choice masks reasoning quality — a model can guess correctly without sound logic.

Frequently Asked Questions

What does a good GPQA score look like?

A human PhD in the relevant field scores about 65% on the Diamond subset, while a non-expert with Google scores around 34%. Frontier models in 2026 are in the 80–90% range; mid-tier open-source models land between 45% and 65%.

Why is GPQA called Google-Proof?

Every question was checked by non-expert validators who had open web access and 30 minutes per question. Only questions that the validators could not solve made it into the final dataset, so a model has to reason rather than retrieve.

Is GPQA the same as MMLU?

No. MMLU is broad and shallow — high school through college knowledge across 57 subjects. GPQA is narrow and deep — graduate-level reasoning in three hard sciences. Strong MMLU scores do not guarantee strong GPQA scores.

Can a model trained on textbooks beat GPQA?

Partially. Subject knowledge sets a floor, but the questions require multi-step reasoning that textbook memorization alone cannot solve. The best scores correlate with chain-of-thought ability, not training corpus size.

Related Benchmarks

Based on score correlations across our database.

Pearson r

Picking the Right Model for Your Use Case?

We help product and engineering teams turn benchmark scores into shipped features. Free first conversation.