How the best AI agent systems stack up across the benchmarks that test real work, not just model trivia. Updated weekly.
Ranked by average score across every benchmark a system has been tested on. Systems need results on at least three benchmarks to appear here.
| 01 | HAL Generalist Agent · Claude Opus 4.1 (August 2025) | Claude Opus 4.1 (August 2025) | HAL Generalist Agent | Research | 4 | 49.0 |
| 02 | HAL Generalist Agent · Claude Sonnet 4.5 (September 2025) | Claude Sonnet 4.5 (September 2025) | HAL Generalist Agent | Research | 3 | 47.3 |
| 03 | HAL Generalist Agent · Claude Sonnet 4.5 High (September 2025) | Claude Sonnet 4.5 High (September 2025) | HAL Generalist Agent | Research | 3 | 46.6 |
| 04 | HAL Generalist Agent · Claude Opus 4 High (May 2025) | Claude Opus 4 High (May 2025) | HAL Generalist Agent | Tool Calling | 3 | 46.3 |
| 05 | HAL Generalist Agent · Claude Opus 4.1 High (August 2025) | Claude Opus 4.1 High (August 2025) | HAL Generalist Agent | Research | 4 | 45.0 |
| 06 | HAL Generalist Agent · Claude Opus 4 (May 2025) | Claude Opus 4 (May 2025) | HAL Generalist Agent | Tool Calling | 3 | 36.1 |
| 07 | HAL Generalist Agent · Claude-3.7 Sonnet High (February 2025) | Claude-3.7 Sonnet High (February 2025) | HAL Generalist Agent | Coding | 5 | 34.6 |
| 08 | HAL Generalist Agent · Claude-3.7 Sonnet (February 2025) | Claude-3.7 Sonnet (February 2025) | HAL Generalist Agent | Coding | 5 | 34.5 |
| 09 | HAL Generalist Agent · GPT-5 Medium (August 2025) | GPT-5 Medium (August 2025) | HAL Generalist Agent | Research | 4 | 28.1 |
| 10 | HAL Generalist Agent · o4-mini High (April 2025) | o4-mini High (April 2025) | HAL Generalist Agent | Coding | 5 | 22.3 |
| 11 | HAL Generalist Agent · o4-mini Low (April 2025) | o4-mini Low (April 2025) | HAL Generalist Agent | Coding | 5 | 21.6 |
| 12 | HAL Generalist Agent · GPT-4.1 (April 2025) | GPT-4.1 (April 2025) | HAL Generalist Agent | Coding | 6 | 19.5 |
| 13 | HAL Generalist Agent · o3 Medium (April 2025) | o3 Medium (April 2025) | HAL Generalist Agent | Coding | 5 | 14.8 |
| 14 | HAL Generalist Agent · DeepSeek V3 (March 2025) | DeepSeek V3 (March 2025) | HAL Generalist Agent | Coding | 5 | 13.3 |
| 15 | HAL Generalist Agent · Gemini 2.0 Flash (February 2025) | Gemini 2.0 Flash (February 2025) | HAL Generalist Agent | Coding | 5 | 12.2 |
| 16 | HAL Generalist Agent · DeepSeek R1 (January 2025) | DeepSeek R1 (January 2025) | HAL Generalist Agent | Coding | 5 | 10.2 |
The top agent systems on each individual benchmark. Filters above apply here too.
Real-world assistant questions that need web browsing, tool use, and multi-step reasoning to answer correctly.
| # | Agent System | Model | Score |
|---|---|---|---|
| 01 | HAL Generalist Agent | Claude Sonnet 4.5 (September 2025) | 74.5 |
| 02 | HAL Generalist Agent | Claude Sonnet 4.5 High (September 2025) | 70.9 |
| 03 | HAL Generalist Agent | Claude Opus 4.1 High (August 2025) | 68.5 |
| 04 | HAL Generalist Agent | Claude Opus 4 High (May 2025) | 64.8 |
| 05 | HAL Generalist Agent | Claude Opus 4.1 (August 2025) | 64.2 |
| 06 | HAL Generalist Agent | Claude-3.7 Sonnet High (February 2025) | 64.2 |
| 07 | HF Open Deep Research | GPT-5 Medium (August 2025) | 62.8 |
| 08 | HAL Generalist Agent | GPT-5 Medium (August 2025) | 59.4 |
| 09 | HAL Generalist Agent | o4-mini Low (April 2025) | 58.2 |
| 10 | HF Open Deep Research | Claude Opus 4 (May 2025) | 57.6 |
A smaller, cheaper slice of SWE-bench Verified used to compare coding agents without a huge compute bill.
| # | Agent System | Model | Score |
|---|---|---|---|
| 01 | SWE-Agent | Claude Sonnet 4.5 High (September 2025) | 72.0 |
| 02 | SWE-Agent | Claude Sonnet 4.5 (September 2025) | 68.0 |
| 03 | SWE-Agent | Claude Opus 4.1 (August 2025) | 61.0 |
| 04 | SWE-Agent | Claude Opus 4.1 High (August 2025) | 54.0 |
| 05 | SWE-Agent | Claude-3.7 Sonnet High (February 2025) | 54.0 |
| 06 | SWE-Agent | o4-mini Low (April 2025) | 54.0 |
| 07 | SWE-Agent | Claude Opus 4 (May 2025) | 50.0 |
| 08 | SWE-Agent | Claude-3.7 Sonnet (February 2025) | 50.0 |
| 09 | SWE-Agent | o4-mini High (April 2025) | 50.0 |
| 10 | SWE-Agent | o3 Medium (April 2025) | 46.0 |
A customer-service simulation where an agent has to follow airline policy and use tools to actually resolve a request.
| # | Agent System | Model | Score |
|---|---|---|---|
| 01 | HAL Generalist Agent | Claude-3.7 Sonnet (February 2025) | 56.0 |
| 02 | TAU-bench Tool Calling | o4-mini High (April 2025) | 56.0 |
| 03 | HAL Generalist Agent | Claude Opus 4.1 (August 2025) | 54.0 |
| 04 | TAU-bench Tool Calling | o3 Medium (April 2025) | 54.0 |
| 05 | TAU-bench Tool Calling | Claude Opus 4.1 High (August 2025) | 52.0 |
| 06 | TAU-bench Tool Calling | Claude-3.7 Sonnet High (February 2025) | 52.0 |
| 07 | TAU-bench Tool Calling | Claude Opus 4.1 (August 2025) | 50.0 |
| 08 | TAU-bench Tool Calling | GPT-5 Medium (August 2025) | 48.0 |
| 09 | HAL Generalist Agent | Claude Opus 4 High (May 2025) | 44.0 |
| 10 | HAL Generalist Agent | Claude Opus 4 (May 2025) | 44.0 |
Time-consuming, realistic web tasks that require browsing many live pages to find one answer.
| # | Agent System | Model | Score |
|---|---|---|---|
| 01 | Browser-Use | — | 38.8 |
Reproduce the results of published research papers from their code and data, on the hardest setting.
| # | Agent System | Model | Score |
|---|---|---|---|
| 01 | Claude Code Submitted by Nicholas Carlini Download main.py | Claude Opus 4.5 | 77.8 |
| 02 | Claude Code Submitted by Nicholas Carlini Download main.py | Claude Sonnet 4.5 (September 2025) | 62.2 |
| 03 | CORE-Agent | Claude Opus 4.1 (August 2025) | 51.1 |
| 04 | Claude Code Submitted by Nicholas Carlini Download main.py | Claude Sonnet 4 (May 2025) | 46.7 |
| 05 | CORE-Agent | Claude Sonnet 4.5 High (September 2025) | 44.4 |
| 06 | CORE-Agent | Claude Opus 4.5 High (November 2025) | 42.2 |
| 07 | CORE-Agent | Claude Opus 4.1 High (August 2025) | 42.2 |
| 08 | Claude Code Submitted by Nicholas Carlini Download main.py | Claude Opus 4.1 | 42.2 |
| 09 | CORE-Agent | Claude Opus 4.5 (November 2025) | 42.2 |
| 10 | CORE-Agent | Gemini 3 Pro Preview High (November 2025) | 40.0 |
Live website tasks that test whether a browser agent can complete real actions on the open web.
| # | Agent System | Model | Score |
|---|---|---|---|
| 01 | SeeAct | GPT-5 Medium (August 2025) | 42.3 |
| 02 | Browser-Use | Claude Sonnet 4 (May 2025) | 40.0 |
| 03 | Browser-Use | Claude Sonnet 4 High (May 2025) | 39.3 |
| 04 | Browser-Use | Claude-3.7 Sonnet High (February 2025) | 39.3 |
| 05 | SeeAct | o3 Medium (April 2025) | 39.0 |
| 06 | Browser-Use | Claude-3.7 Sonnet (February 2025) | 38.3 |
| 07 | SeeAct | Claude Sonnet 4 High (May 2025) | 36.7 |
| 08 | SeeAct | Claude Sonnet 4 (May 2025) | 36.7 |
| 09 | Browser-Use | GPT-4.1 (April 2025) | 36.3 |
| 10 | Browser-Use | DeepSeek V3 (March 2025) | 32.3 |
Research-grade scientific coding problems drawn from real physics, biology, and chemistry work.
| # | Agent System | Model | Score |
|---|---|---|---|
| 01 | Scicode Tool Calling Agent | o3 Medium (April 2025) | 9.2 |
| 02 | Scicode Zero Shot Agent | o4-mini Low (April 2025) | 9.2 |
| 03 | Scicode Tool Calling Agent | Claude Opus 4.1 (August 2025) | 7.7 |
| 04 | Scicode Tool Calling Agent | Claude Opus 4.1 High (August 2025) | 6.9 |
| 05 | Scicode Tool Calling Agent | GPT-5 Medium (August 2025) | 6.2 |
| 06 | Scicode Zero Shot Agent | o4-mini High (April 2025) | 6.2 |
| 07 | HAL Generalist Agent | o4-mini Low (April 2025) | 6.2 |
| 08 | Scicode Zero Shot Agent | GPT-4.1 (April 2025) | 6.2 |
| 09 | Scicode Tool Calling Agent | Claude-3.7 Sonnet High (February 2025) | 4.6 |
| 10 | Scicode Tool Calling Agent | o4-mini High (April 2025) | 4.6 |
Data-driven scientific discovery tasks that ask an agent to run an analysis and produce a correct result.
| # | Agent System | Model | Score |
|---|---|---|---|
| 01 | SAB Self-Debug | — | 33.3 |
| 02 | HAL Generalist Agent | — | 21.6 |
Competitive programming problems that demand real algorithmic reasoning, not just boilerplate code.
| # | Agent System | Model | Score |
|---|---|---|---|
| 01 | USACO Episodic + Semantic | GPT-5 Medium (August 2025) | 69.1 |
| 02 | USACO Episodic + Semantic | o4-mini High (April 2025) | 64.8 |
| 03 | USACO Episodic + Semantic | o4-mini Low (April 2025) | 53.1 |
| 04 | USACO Episodic + Semantic | Claude Opus 4.1 High (August 2025) | 51.5 |
| 05 | USACO Episodic + Semantic | Claude Opus 4.1 (August 2025) | 48.2 |
| 06 | USACO Episodic + Semantic | o3 Medium (April 2025) | 46.3 |
| 07 | USACO Episodic + Semantic | GPT-4.1 (April 2025) | 45.0 |
| 08 | USACO Episodic + Semantic | DeepSeek V3 (March 2025) | 39.1 |
| 09 | USACO Episodic + Semantic | DeepSeek R1 (January 2025) | 38.1 |
| 10 | USACO Episodic + Semantic | Claude-3.7 Sonnet (February 2025) | 29.3 |

Pick a Stack
See the scaffolds behind these scores ranked head to head: stars, downloads, capabilities, and trade-offs across LangGraph, CrewAI, Mastra, and more.

Budget It
A strong score means little if the agent costs a fortune to run. Estimate per-task and monthly spend for your workflow with live model prices.

Go Deeper
Plain-language deep dives on the model benchmarks behind the agents: what each one measures, who tops it, and how they correlate.
An agent benchmark leaderboard ranks complete agent systems, a model plus its framework and agent type, by how well they finish real tasks like fixing code, browsing the web, and following a support policy, not by trivia scores.
A model benchmark asks a single question and grades the answer. An agent benchmark gives a goal and a set of tools, then checks whether the system actually got the job done: did the bug fix pass the tests, did the booking change go through, did the research question get the right number. That is why the same model can land far apart on two different agent setups.
This leaderboard pulls results from the benchmarks that test those skills and ranks each agent system by its average across everything it has been tested on. Use the filters to compare a single model across frameworks, or a single framework across models, and read each benchmark section to see where a system is strong or weak.
Each system gets the average of its scores across the benchmarks it has been tested on, after every score is put on the same 0 to 100 scale. A system has to appear on at least three benchmarks to be ranked, so one lucky result cannot top the table. Because agents run different benchmark mixes, treat the overall number as a useful summary, not a precise head-to-head.
A bare model cannot browse the web, run tests, or call tools. Real agent performance comes from the model plus the framework and agent loop around it. Two systems on the same model often score very differently, so we track the whole system and let you filter down to the model or framework you care about.
A sync job runs weekly and pulls fresh results from the public benchmark feeds we track. Scores entered by hand for benchmarks without a clean data feed stay put between syncs.
Yes. The embed shows the top three agent systems in a compact view with a link back to the full page, which makes it easy to cite in a blog post or report.
The first release covers the benchmarks with a reliable machine-readable feed: GAIA, SWE-bench Verified and its mini subset, Tau-Bench Airline, AssistantBench, CORE-Bench Hard, Online-Mind2Web, SciCode, ScienceAgentBench, and USACO. More are added as clean data sources become available.