Time-consuming, realistic web tasks that require browsing many live pages to find one answer.
AssistantBench asks the kind of research question that would take a person many minutes of clicking around the open web, such as comparing prices across sites or pulling a figure out of a report. It measures whether a web agent can navigate live pages and return an accurate answer, not just summarize a single page.
Agents browse the live web to answer each question. Answers are scored for accuracy against a gold answer, with partial credit for close numeric or list answers. The headline number is an accuracy score across the task set.
| # | Agent System | Model | Score |
|---|---|---|---|
| 01 | Browser-Use | — | 38.8 |
Very. Even strong web agents score well under 50% accuracy, and many land in the teens, because real multi-site research is unforgiving.
Browse the other benchmarks on the leaderboard.