Agent Benchmark · 2024

CORE-Bench Hard: CORE-Bench (Hard)

Name: CORE-Bench Hard: CORE-Bench (Hard)
Creator: Princeton
Published: 2024
Keywords: CORE-Bench Hard, AI agent benchmark, Princeton

Reproduce the results of published research papers from their code and data, on the hardest setting.

View Dataset Read Paper

Systems Ranked

Top Score

77.8

Published

2024

Source

Princeton

How It Works

CORE-Bench hands an agent the code and data behind a real scientific paper and asks it to reproduce specific results. The hard setting gives the least hand-holding, so the agent has to install dependencies, run the right scripts, and read the output. It measures whether an agent can operate a real research codebase end to end.

Each task provides a paper's repository and a set of questions whose answers come from rerunning the analysis. The agent works in a shell and must produce the correct numeric or categorical answers. Scores are the percentage of tasks where every answer is correct.

Dataset size

Computational reproducibility tasks from real papers.

Agent type

Research

Published by

Princeton

Year

2024

Top Agent Systems

#	Agent System	Model	Score
01	Claude Code Submitted by Nicholas Carlini Download main.py	Claude Opus 4.5	77.8
02	Claude Code Submitted by Nicholas Carlini Download main.py	Claude Sonnet 4.5 (September 2025)	62.2
03	CORE-Agent	Claude Opus 4.1 (August 2025)	51.1
04	Claude Code Submitted by Nicholas Carlini Download main.py	Claude Sonnet 4 (May 2025)	46.7
05	CORE-Agent	Claude Sonnet 4.5 High (September 2025)	44.4
06	CORE-Agent	Claude Opus 4.5 High (November 2025)	42.2
07	CORE-Agent	Claude Opus 4.1 High (August 2025)	42.2
08	Claude Code Submitted by Nicholas Carlini Download main.py	Claude Opus 4.1	42.2
09	CORE-Agent	Claude Opus 4.5 (November 2025)	42.2
10	CORE-Agent	Gemini 3 Pro Preview High (November 2025)	40.0
11	CORE-Agent	Claude Sonnet 4.5 (September 2025)	37.8
12	HAL Generalist Agent	Claude-3.7 Sonnet High (February 2025)	37.8
13	HAL Generalist Agent	Claude Opus 4.1 (August 2025)	35.6
14	HAL Generalist Agent	Gemini 3 Pro Preview High (November 2025)	35.6
15	CORE-Agent	Claude-3.7 Sonnet (February 2025)	35.6
16	HAL Generalist Agent	o4-mini High (April 2025)	35.6
17	HAL Generalist Agent	Claude Opus 4.1 High (August 2025)	33.3
18	HAL Generalist Agent	Claude Opus 4.5 (November 2025)	33.3
19	CORE-Agent	GPT-4.1 (April 2025)	33.3
20	CORE-Agent	Claude Sonnet 4 High (May 2025)	33.3
21	HAL Generalist Agent	Claude Sonnet 4.5 (September 2025)	33.3
22	HAL Generalist Agent	Claude Opus 4.5 High (November 2025)	31.1
23	HAL Generalist Agent	Claude-3.7 Sonnet (February 2025)	31.1
24	HAL Generalist Agent	Claude Sonnet 4.5 High (September 2025)	28.9
25	CORE-Agent	Claude Sonnet 4 (May 2025)	28.9

Strengths

Tests long, messy, real-world computational workflows.
Answers are checkable, so scoring is objective.
Reflects a concrete need: reproducible science and data work.

Limitations

Environment setup failures can dominate the score.
Skewed toward the languages and tools common in the source papers.
Long tasks make runs slow and expensive.

Frequently Asked Questions

Why is the hard setting separated out?

CORE-Bench has easier settings that give agents more scaffolding. The hard setting strips that away, so it is the truest test of autonomous research-code execution and the one this leaderboard tracks.

Other Agent Benchmarks

Browse the other benchmarks on the leaderboard.

Research

GAIA

Real-world assistant questions that need web browsing, tool use, and multi-step reasoning to answer correctly.

Research

ScienceAgentBench

Data-driven scientific discovery tasks that ask an agent to run an analysis and produce a correct result.

Coding

SWE-bench Verified

Real GitHub issues an agent has to fix by editing a codebase until the project test suite passes.

Coding

SWE-bench Verified Mini

A smaller, cheaper slice of SWE-bench Verified used to compare coding agents without a huge compute bill.

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

Agent Benchmark · 2024

CORE-Bench Hard: CORE-Bench (Hard)

Reproduce the results of published research papers from their code and data, on the hardest setting.

View Dataset Read Paper

Systems Ranked

Top Score

77.8

Published

2024

Source

Princeton

How It Works

Dataset size

Computational reproducibility tasks from real papers.

Agent type

Research

Published by

Princeton

Year

2024

Top Agent Systems

#	Agent System	Model	Score
01	Claude Code Submitted by Nicholas Carlini Download main.py	Claude Opus 4.5	77.8
02	Claude Code Submitted by Nicholas Carlini Download main.py	Claude Sonnet 4.5 (September 2025)	62.2
03	CORE-Agent	Claude Opus 4.1 (August 2025)	51.1
04	Claude Code Submitted by Nicholas Carlini Download main.py	Claude Sonnet 4 (May 2025)	46.7
05	CORE-Agent	Claude Sonnet 4.5 High (September 2025)	44.4
06	CORE-Agent	Claude Opus 4.5 High (November 2025)	42.2
07	CORE-Agent	Claude Opus 4.1 High (August 2025)	42.2
08	Claude Code Submitted by Nicholas Carlini Download main.py	Claude Opus 4.1	42.2
09	CORE-Agent	Claude Opus 4.5 (November 2025)	42.2
10	CORE-Agent	Gemini 3 Pro Preview High (November 2025)	40.0
11	CORE-Agent	Claude Sonnet 4.5 (September 2025)	37.8
12	HAL Generalist Agent	Claude-3.7 Sonnet High (February 2025)	37.8
13	HAL Generalist Agent	Claude Opus 4.1 (August 2025)	35.6
14	HAL Generalist Agent	Gemini 3 Pro Preview High (November 2025)	35.6
15	CORE-Agent	Claude-3.7 Sonnet (February 2025)	35.6
16	HAL Generalist Agent	o4-mini High (April 2025)	35.6
17	HAL Generalist Agent	Claude Opus 4.1 High (August 2025)	33.3
18	HAL Generalist Agent	Claude Opus 4.5 (November 2025)	33.3
19	CORE-Agent	GPT-4.1 (April 2025)	33.3
20	CORE-Agent	Claude Sonnet 4 High (May 2025)	33.3
21	HAL Generalist Agent	Claude Sonnet 4.5 (September 2025)	33.3
22	HAL Generalist Agent	Claude Opus 4.5 High (November 2025)	31.1
23	HAL Generalist Agent	Claude-3.7 Sonnet (February 2025)	31.1
24	HAL Generalist Agent	Claude Sonnet 4.5 High (September 2025)	28.9
25	CORE-Agent	Claude Sonnet 4 (May 2025)	28.9

Strengths

Tests long, messy, real-world computational workflows.
Answers are checkable, so scoring is objective.
Reflects a concrete need: reproducible science and data work.

Limitations

Environment setup failures can dominate the score.
Skewed toward the languages and tools common in the source papers.
Long tasks make runs slow and expensive.

Frequently Asked Questions

Why is the hard setting separated out?

Other Agent Benchmarks

Browse the other benchmarks on the leaderboard.

Research

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.