Live website tasks that test whether a browser agent can complete real actions on the open web.
Online-Mind2Web is a live-web version of the Mind2Web task set. Instead of replaying recorded pages, the agent acts on real, current websites to complete goals like booking, searching, or filling forms. It measures whether a browser agent works in the wild, where layouts change and pages are unpredictable.
Agents drive a real browser to complete each task on the live site. Success is judged by whether the goal was actually achieved, often with a mix of automated checks and human verification. The headline number is a task success rate.
| # | Agent System | Model | Score |
|---|---|---|---|
| 01 | SeeAct | GPT-5 Medium (August 2025) | 42.3 |
| 02 | Browser-Use | Claude Sonnet 4 (May 2025) | 40.0 |
| 03 | Browser-Use | Claude Sonnet 4 High (May 2025) | 39.3 |
| 04 | Browser-Use | Claude-3.7 Sonnet High (February 2025) | 39.3 |
| 05 | SeeAct | o3 Medium (April 2025) | 39.0 |
| 06 | Browser-Use | Claude-3.7 Sonnet (February 2025) | 38.3 |
| 07 | SeeAct | Claude Sonnet 4 High (May 2025) | 36.7 |
| 08 | SeeAct | Claude Sonnet 4 (May 2025) | 36.7 |
| 09 | Browser-Use | GPT-4.1 (April 2025) | 36.3 |
| 10 | Browser-Use | DeepSeek V3 (March 2025) | 32.3 |
| 11 | SeeAct | o4-mini High (April 2025) | 32.0 |
| 12 | Browser-Use | GPT-5 Medium (August 2025) | 32.0 |
| 13 | SeeAct | o4-mini Low (April 2025) | 31.7 |
| 14 | SeeAct | Claude-3.7 Sonnet High (February 2025) | 30.3 |
| 15 | SeeAct | GPT-4.1 (April 2025) | 30.3 |
| 16 | Browser-Use | o3 Medium (April 2025) | 29.0 |
| 17 | Browser-Use | Gemini 2.0 Flash (February 2025) | 29.0 |
| 18 | SeeAct | Claude-3.7 Sonnet (February 2025) | 28.3 |
| 19 | SeeAct | Gemini 2.0 Flash (February 2025) | 26.7 |
| 20 | Browser-Use | DeepSeek R1 (January 2025) | 25.3 |
| 21 | Browser-Use | o4-mini High (April 2025) | 20.0 |
| 22 | Browser-Use | o4-mini Low (April 2025) | 18.3 |
Saved-page benchmarks let agents overfit to a fixed snapshot. Real websites move buttons, change flows, and add pop-ups, so live testing is the only way to know an agent will hold up in production.
Browse the other benchmarks on the leaderboard.