Agent Benchmark · 2024

ScienceAgentBench: ScienceAgentBench

Name: ScienceAgentBench: ScienceAgentBench
Creator: Ohio State University
Published: 2024
Keywords: ScienceAgentBench, AI agent benchmark, Ohio State University

Data-driven scientific discovery tasks that ask an agent to run an analysis and produce a correct result.

View Dataset Read Paper

Systems Ranked

Top Score

33.3

Published

2024

Source

Ohio State University

How It Works

ScienceAgentBench gives an agent a realistic data-analysis task pulled from a published paper, such as cleaning a dataset, fitting a model, or making a figure. It measures whether the agent can write and run the code that produces a scientifically valid output, judged against what the original researchers did.

Agents work in a coding environment and submit a program plus its output. Results are scored on whether the output is valid and matches the expected result, using a rubric that checks the produced artifacts. The headline number is a success rate across the 102 tasks.

Dataset size

102 tasks drawn from real published research.

Agent type

Research

Published by

Ohio State University

Year

2024

Top Agent Systems

#	Agent System	Model	Score
01	SAB Self-Debug	—	33.3
02	HAL Generalist Agent	—	21.6

Strengths

Grounded in real published analyses, not synthetic problems.
Tests the full data-science loop from code to result.
Rubric-based grading captures partial progress.

Limitations

Rubric grading is more subjective than pass/fail tests.
Narrow to data-driven scientific workflows.
Small task count limits statistical precision.

Frequently Asked Questions

How do agents perform on ScienceAgentBench?

Success rates are modest, often well under half, because real scientific analysis has many ways to go subtly wrong even when the code runs.

Other Agent Benchmarks

Browse the other benchmarks on the leaderboard.

Research

GAIA

Real-world assistant questions that need web browsing, tool use, and multi-step reasoning to answer correctly.

Research

CORE-Bench Hard

Reproduce the results of published research papers from their code and data, on the hardest setting.

Coding

SWE-bench Verified

Real GitHub issues an agent has to fix by editing a codebase until the project test suite passes.

Coding

SWE-bench Verified Mini

A smaller, cheaper slice of SWE-bench Verified used to compare coding agents without a huge compute bill.

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

Agent Benchmark · 2024

ScienceAgentBench: ScienceAgentBench

Data-driven scientific discovery tasks that ask an agent to run an analysis and produce a correct result.

View Dataset Read Paper

Systems Ranked

Top Score

33.3

Published

2024

Source

Ohio State University

How It Works

Dataset size

102 tasks drawn from real published research.

Agent type

Research

Published by

Ohio State University

Year

2024

Top Agent Systems

#	Agent System	Model	Score
01	SAB Self-Debug	—	33.3
02	HAL Generalist Agent	—	21.6

Strengths

Grounded in real published analyses, not synthetic problems.
Tests the full data-science loop from code to result.
Rubric-based grading captures partial progress.

Limitations

Rubric grading is more subjective than pass/fail tests.
Narrow to data-driven scientific workflows.
Small task count limits statistical precision.

Frequently Asked Questions

How do agents perform on ScienceAgentBench?

Success rates are modest, often well under half, because real scientific analysis has many ways to go subtly wrong even when the code runs.

Other Agent Benchmarks

Browse the other benchmarks on the leaderboard.

Research

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.