Reproduce the results of published research papers from their code and data, on the hardest setting.
CORE-Bench hands an agent the code and data behind a real scientific paper and asks it to reproduce specific results. The hard setting gives the least hand-holding, so the agent has to install dependencies, run the right scripts, and read the output. It measures whether an agent can operate a real research codebase end to end.
Each task provides a paper's repository and a set of questions whose answers come from rerunning the analysis. The agent works in a shell and must produce the correct numeric or categorical answers. Scores are the percentage of tasks where every answer is correct.
| # | Agent System | Model | Score |
|---|---|---|---|
| 01 | Claude Code Submitted by Nicholas Carlini Download main.py | Claude Opus 4.5 | 77.8 |
| 02 | Claude Code Submitted by Nicholas Carlini Download main.py | Claude Sonnet 4.5 (September 2025) | 62.2 |
| 03 | CORE-Agent | Claude Opus 4.1 (August 2025) | 51.1 |
| 04 | Claude Code Submitted by Nicholas Carlini Download main.py | Claude Sonnet 4 (May 2025) | 46.7 |
| 05 | CORE-Agent | Claude Sonnet 4.5 High (September 2025) | 44.4 |
| 06 | CORE-Agent | Claude Opus 4.5 High (November 2025) | 42.2 |
| 07 | CORE-Agent | Claude Opus 4.1 High (August 2025) | 42.2 |
| 08 | Claude Code Submitted by Nicholas Carlini Download main.py | Claude Opus 4.1 | 42.2 |
| 09 | CORE-Agent | Claude Opus 4.5 (November 2025) | 42.2 |
| 10 | CORE-Agent | Gemini 3 Pro Preview High (November 2025) | 40.0 |
| 11 | CORE-Agent | Claude Sonnet 4.5 (September 2025) | 37.8 |
| 12 | HAL Generalist Agent | Claude-3.7 Sonnet High (February 2025) | 37.8 |
| 13 | HAL Generalist Agent | Claude Opus 4.1 (August 2025) | 35.6 |
| 14 | HAL Generalist Agent | Gemini 3 Pro Preview High (November 2025) | 35.6 |
| 15 | CORE-Agent | Claude-3.7 Sonnet (February 2025) | 35.6 |
| 16 | HAL Generalist Agent | o4-mini High (April 2025) | 35.6 |
| 17 | HAL Generalist Agent | Claude Opus 4.1 High (August 2025) | 33.3 |
| 18 | HAL Generalist Agent | Claude Opus 4.5 (November 2025) | 33.3 |
| 19 | CORE-Agent | GPT-4.1 (April 2025) | 33.3 |
| 20 | CORE-Agent | Claude Sonnet 4 High (May 2025) | 33.3 |
| 21 | HAL Generalist Agent | Claude Sonnet 4.5 (September 2025) | 33.3 |
| 22 | HAL Generalist Agent | Claude-3.7 Sonnet (February 2025) | 31.1 |
| 23 | HAL Generalist Agent | Claude Opus 4.5 High (November 2025) | 31.1 |
| 24 | HAL Generalist Agent | Claude Sonnet 4.5 High (September 2025) | 28.9 |
| 25 | CORE-Agent | Claude Sonnet 4 (May 2025) | 28.9 |
CORE-Bench has easier settings that give agents more scaffolding. The hard setting strips that away, so it is the truest test of autonomous research-code execution and the one this leaderboard tracks.
Browse the other benchmarks on the leaderboard.