Agent Benchmark · 2024

AssistantBench: AssistantBench

Name: AssistantBench: AssistantBench
Creator: Allen Institute for AI and others
Published: 2024
Keywords: AssistantBench, AI agent benchmark, Allen Institute for AI and others

Time-consuming, realistic web tasks that require browsing many live pages to find one answer.

View Dataset Read Paper

Systems Ranked

Top Score

38.8

Published

2024

Source

Allen Institute for AI and others

How It Works

AssistantBench asks the kind of research question that would take a person many minutes of clicking around the open web, such as comparing prices across sites or pulling a figure out of a report. It measures whether a web agent can navigate live pages and return an accurate answer, not just summarize a single page.

Agents browse the live web to answer each question. Answers are scored for accuracy against a gold answer, with partial credit for close numeric or list answers. The headline number is an accuracy score across the task set.

Dataset size

214 realistic, open-web information tasks.

Agent type

Browser

Published by

Allen Institute for AI and others

Year

2024

Top Agent Systems

#	Agent System	Model	Score
01	Browser-Use	—	38.8

Strengths

Tasks reflect genuinely useful, time-saving web research.
Open-web setting tests robustness to messy real pages.
Partial-credit scoring handles near-miss numeric answers fairly.

Limitations

Live pages change, so scores drift over time.
Accuracy grading on open answers is harder than exact match.
Performance depends heavily on the browsing tools provided.

Frequently Asked Questions

How hard is AssistantBench for current agents?

Very. Even strong web agents score well under 50% accuracy, and many land in the teens, because real multi-site research is unforgiving.

Other Agent Benchmarks

Browse the other benchmarks on the leaderboard.

Browser

Online-Mind2Web

Live website tasks that test whether a browser agent can complete real actions on the open web.

Research

GAIA

Real-world assistant questions that need web browsing, tool use, and multi-step reasoning to answer correctly.

Coding

SWE-bench Verified

Real GitHub issues an agent has to fix by editing a codebase until the project test suite passes.

Coding

SWE-bench Verified Mini

A smaller, cheaper slice of SWE-bench Verified used to compare coding agents without a huge compute bill.

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

Agent Benchmark · 2024

AssistantBench: AssistantBench

Time-consuming, realistic web tasks that require browsing many live pages to find one answer.

View Dataset Read Paper

Systems Ranked

Top Score

38.8

Published

2024

Source

Allen Institute for AI and others

How It Works

Dataset size

214 realistic, open-web information tasks.

Agent type

Browser

Published by

Allen Institute for AI and others

Year

2024

Top Agent Systems

#	Agent System	Model	Score
01	Browser-Use	—	38.8

Strengths

Tasks reflect genuinely useful, time-saving web research.
Open-web setting tests robustness to messy real pages.
Partial-credit scoring handles near-miss numeric answers fairly.

Limitations

Live pages change, so scores drift over time.
Accuracy grading on open answers is harder than exact match.
Performance depends heavily on the browsing tools provided.

Frequently Asked Questions

How hard is AssistantBench for current agents?

Very. Even strong web agents score well under 50% accuracy, and many land in the teens, because real multi-site research is unforgiving.

Other Agent Benchmarks

Browse the other benchmarks on the leaderboard.

Browser

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.