Data-driven scientific discovery tasks that ask an agent to run an analysis and produce a correct result.
ScienceAgentBench gives an agent a realistic data-analysis task pulled from a published paper, such as cleaning a dataset, fitting a model, or making a figure. It measures whether the agent can write and run the code that produces a scientifically valid output, judged against what the original researchers did.
Agents work in a coding environment and submit a program plus its output. Results are scored on whether the output is valid and matches the expected result, using a rubric that checks the produced artifacts. The headline number is a success rate across the 102 tasks.
| # | Agent System | Model | Score |
|---|---|---|---|
| 01 | SAB Self-Debug | — | 33.3 |
| 02 | HAL Generalist Agent | — | 21.6 |
Success rates are modest, often well under half, because real scientific analysis has many ways to go subtly wrong even when the code runs.
Browse the other benchmarks on the leaderboard.