Agent Benchmark · 2025

Online-Mind2Web: Online-Mind2Web

Name: Online-Mind2Web: Online-Mind2Web
Creator: Ohio State University
Published: 2025
Keywords: Online-Mind2Web, AI agent benchmark, Ohio State University

Live website tasks that test whether a browser agent can complete real actions on the open web.

View Dataset Read Paper

Systems Ranked

Top Score

42.3

Published

2025

Source

Ohio State University

How It Works

Online-Mind2Web is a live-web version of the Mind2Web task set. Instead of replaying recorded pages, the agent acts on real, current websites to complete goals like booking, searching, or filling forms. It measures whether a browser agent works in the wild, where layouts change and pages are unpredictable.

Agents drive a real browser to complete each task on the live site. Success is judged by whether the goal was actually achieved, often with a mix of automated checks and human verification. The headline number is a task success rate.

Dataset size

Live web tasks across many popular websites.

Agent type

Browser

Published by

Ohio State University

Year

2025

Top Agent Systems

#	Agent System	Model	Score
01	SeeAct	GPT-5 Medium (August 2025)	42.3
02	Browser-Use	Claude Sonnet 4 (May 2025)	40.0
03	Browser-Use	Claude Sonnet 4 High (May 2025)	39.3
04	Browser-Use	Claude-3.7 Sonnet High (February 2025)	39.3
05	SeeAct	o3 Medium (April 2025)	39.0
06	Browser-Use	Claude-3.7 Sonnet (February 2025)	38.3
07	SeeAct	Claude Sonnet 4 High (May 2025)	36.7
08	SeeAct	Claude Sonnet 4 (May 2025)	36.7
09	Browser-Use	GPT-4.1 (April 2025)	36.3
10	Browser-Use	DeepSeek V3 (March 2025)	32.3
11	Browser-Use	GPT-5 Medium (August 2025)	32.0
12	SeeAct	o4-mini High (April 2025)	32.0
13	SeeAct	o4-mini Low (April 2025)	31.7
14	SeeAct	Claude-3.7 Sonnet High (February 2025)	30.3
15	SeeAct	GPT-4.1 (April 2025)	30.3
16	Browser-Use	o3 Medium (April 2025)	29.0
17	Browser-Use	Gemini 2.0 Flash (February 2025)	29.0
18	SeeAct	Claude-3.7 Sonnet (February 2025)	28.3
19	SeeAct	Gemini 2.0 Flash (February 2025)	26.7
20	Browser-Use	DeepSeek R1 (January 2025)	25.3
21	Browser-Use	o4-mini High (April 2025)	20.0
22	Browser-Use	o4-mini Low (April 2025)	18.3

Strengths

Live sites make it a realistic test of deployed browser agents.
Covers a wide spread of everyday website tasks.
Exposes brittleness that recorded-page benchmarks hide.

Limitations

Live sites change, so exact reruns are not possible.
Success judging can require human review.
Anti-bot measures on real sites can block otherwise capable agents.

Frequently Asked Questions

Why test on live sites instead of saved pages?

Saved-page benchmarks let agents overfit to a fixed snapshot. Real websites move buttons, change flows, and add pop-ups, so live testing is the only way to know an agent will hold up in production.

Other Agent Benchmarks

Browse the other benchmarks on the leaderboard.

Browser

AssistantBench

Time-consuming, realistic web tasks that require browsing many live pages to find one answer.

Research

GAIA

Real-world assistant questions that need web browsing, tool use, and multi-step reasoning to answer correctly.

Coding

SWE-bench Verified

Real GitHub issues an agent has to fix by editing a codebase until the project test suite passes.

Coding

SWE-bench Verified Mini

A smaller, cheaper slice of SWE-bench Verified used to compare coding agents without a huge compute bill.

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

Agent Benchmark · 2025

Online-Mind2Web: Online-Mind2Web

Live website tasks that test whether a browser agent can complete real actions on the open web.

View Dataset Read Paper

Systems Ranked

Top Score

42.3

Published

2025

Source

Ohio State University

How It Works

Dataset size

Live web tasks across many popular websites.

Agent type

Browser

Published by

Ohio State University

Year

2025

Top Agent Systems

#	Agent System	Model	Score
01	SeeAct	GPT-5 Medium (August 2025)	42.3
02	Browser-Use	Claude Sonnet 4 (May 2025)	40.0
03	Browser-Use	Claude Sonnet 4 High (May 2025)	39.3
04	Browser-Use	Claude-3.7 Sonnet High (February 2025)	39.3
05	SeeAct	o3 Medium (April 2025)	39.0
06	Browser-Use	Claude-3.7 Sonnet (February 2025)	38.3
07	SeeAct	Claude Sonnet 4 High (May 2025)	36.7
08	SeeAct	Claude Sonnet 4 (May 2025)	36.7
09	Browser-Use	GPT-4.1 (April 2025)	36.3
10	Browser-Use	DeepSeek V3 (March 2025)	32.3
11	Browser-Use	GPT-5 Medium (August 2025)	32.0
12	SeeAct	o4-mini High (April 2025)	32.0
13	SeeAct	o4-mini Low (April 2025)	31.7
14	SeeAct	Claude-3.7 Sonnet High (February 2025)	30.3
15	SeeAct	GPT-4.1 (April 2025)	30.3
16	Browser-Use	o3 Medium (April 2025)	29.0
17	Browser-Use	Gemini 2.0 Flash (February 2025)	29.0
18	SeeAct	Claude-3.7 Sonnet (February 2025)	28.3
19	SeeAct	Gemini 2.0 Flash (February 2025)	26.7
20	Browser-Use	DeepSeek R1 (January 2025)	25.3
21	Browser-Use	o4-mini High (April 2025)	20.0
22	Browser-Use	o4-mini Low (April 2025)	18.3

Strengths

Live sites make it a realistic test of deployed browser agents.
Covers a wide spread of everyday website tasks.
Exposes brittleness that recorded-page benchmarks hide.

Limitations

Live sites change, so exact reruns are not possible.
Success judging can require human review.
Anti-bot measures on real sites can block otherwise capable agents.

Frequently Asked Questions

Why test on live sites instead of saved pages?

Saved-page benchmarks let agents overfit to a fixed snapshot. Real websites move buttons, change flows, and add pop-ups, so live testing is the only way to know an agent will hold up in production.

Other Agent Benchmarks

Browse the other benchmarks on the leaderboard.

Browser

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.