Agent Benchmark Leaderboard

The Live Agent Benchmark Leaderboard

How the best AI agent systems stack up across the benchmarks that test real work, not just model trivia. Updated weekly.

ModelFrameworkAgent Type

16 systems

Overall Agent Ranking

Ranked by average score across every benchmark a system has been tested on. Systems need results on at least three benchmarks to appear here.


01	HAL Generalist Agent · Claude Opus 4.1 (August 2025)	Claude Opus 4.1 (August 2025)	HAL Generalist Agent	Research	4	49.0
02	HAL Generalist Agent · Claude Sonnet 4.5 (September 2025)	Claude Sonnet 4.5 (September 2025)	HAL Generalist Agent	Research	3	47.3
03	HAL Generalist Agent · Claude Sonnet 4.5 High (September 2025)	Claude Sonnet 4.5 High (September 2025)	HAL Generalist Agent	Research	3	46.6
04	HAL Generalist Agent · Claude Opus 4 High (May 2025)	Claude Opus 4 High (May 2025)	HAL Generalist Agent	Tool Calling	3	46.3
05	HAL Generalist Agent · Claude Opus 4.1 High (August 2025)	Claude Opus 4.1 High (August 2025)	HAL Generalist Agent	Research	4	45.0
06	HAL Generalist Agent · Claude Opus 4 (May 2025)	Claude Opus 4 (May 2025)	HAL Generalist Agent	Tool Calling	3	36.1
07	HAL Generalist Agent · Claude-3.7 Sonnet High (February 2025)	Claude-3.7 Sonnet High (February 2025)	HAL Generalist Agent	Coding	5	34.6
08	HAL Generalist Agent · Claude-3.7 Sonnet (February 2025)	Claude-3.7 Sonnet (February 2025)	HAL Generalist Agent	Coding	5	34.5
09	HAL Generalist Agent · GPT-5 Medium (August 2025)	GPT-5 Medium (August 2025)	HAL Generalist Agent	Research	4	28.1
10	HAL Generalist Agent · o4-mini High (April 2025)	o4-mini High (April 2025)	HAL Generalist Agent	Coding	5	22.3
11	HAL Generalist Agent · o4-mini Low (April 2025)	o4-mini Low (April 2025)	HAL Generalist Agent	Coding	5	21.6
12	HAL Generalist Agent · GPT-4.1 (April 2025)	GPT-4.1 (April 2025)	HAL Generalist Agent	Coding	6	19.5
13	HAL Generalist Agent · o3 Medium (April 2025)	o3 Medium (April 2025)	HAL Generalist Agent	Coding	5	14.8
14	HAL Generalist Agent · DeepSeek V3 (March 2025)	DeepSeek V3 (March 2025)	HAL Generalist Agent	Coding	5	13.3
15	HAL Generalist Agent · Gemini 2.0 Flash (February 2025)	Gemini 2.0 Flash (February 2025)	HAL Generalist Agent	Coding	5	12.2
16	HAL Generalist Agent · DeepSeek R1 (January 2025)	DeepSeek R1 (January 2025)	HAL Generalist Agent	Coding	5	10.2

Per-Benchmark Leaderboards

The top agent systems on each individual benchmark. Filters above apply here too.

GAIA

Real-world assistant questions that need web browsing, tool use, and multi-step reasoning to answer correctly.

Deep dive Source

#	Agent System	Model	Score
01	HAL Generalist Agent	Claude Sonnet 4.5 (September 2025)	74.5
02	HAL Generalist Agent	Claude Sonnet 4.5 High (September 2025)	70.9
03	HAL Generalist Agent	Claude Opus 4.1 High (August 2025)	68.5
04	HAL Generalist Agent	Claude Opus 4 High (May 2025)	64.8
05	HAL Generalist Agent	Claude Opus 4.1 (August 2025)	64.2
06	HAL Generalist Agent	Claude-3.7 Sonnet High (February 2025)	64.2
07	HF Open Deep Research	GPT-5 Medium (August 2025)	62.8
08	HAL Generalist Agent	GPT-5 Medium (August 2025)	59.4
09	HAL Generalist Agent	o4-mini Low (April 2025)	58.2
10	HF Open Deep Research	Claude Opus 4 (May 2025)	57.6

SWE-bench Verified Mini

A smaller, cheaper slice of SWE-bench Verified used to compare coding agents without a huge compute bill.

Deep dive Source

#	Agent System	Model	Score
01	SWE-Agent	Claude Sonnet 4.5 High (September 2025)	72.0
02	SWE-Agent	Claude Sonnet 4.5 (September 2025)	68.0
03	SWE-Agent	Claude Opus 4.1 (August 2025)	61.0
04	SWE-Agent	Claude Opus 4.1 High (August 2025)	54.0
05	SWE-Agent	Claude-3.7 Sonnet High (February 2025)	54.0
06	SWE-Agent	o4-mini Low (April 2025)	54.0
07	SWE-Agent	Claude Opus 4 (May 2025)	50.0
08	SWE-Agent	Claude-3.7 Sonnet (February 2025)	50.0
09	SWE-Agent	o4-mini High (April 2025)	50.0
10	SWE-Agent	o3 Medium (April 2025)	46.0

Tau-Bench Airline

A customer-service simulation where an agent has to follow airline policy and use tools to actually resolve a request.

Deep dive Source

#	Agent System	Model	Score
01	HAL Generalist Agent	Claude-3.7 Sonnet (February 2025)	56.0
02	TAU-bench Tool Calling	o4-mini High (April 2025)	56.0
03	HAL Generalist Agent	Claude Opus 4.1 (August 2025)	54.0
04	TAU-bench Tool Calling	o3 Medium (April 2025)	54.0
05	TAU-bench Tool Calling	Claude Opus 4.1 High (August 2025)	52.0
06	TAU-bench Tool Calling	Claude-3.7 Sonnet High (February 2025)	52.0
07	TAU-bench Tool Calling	Claude Opus 4.1 (August 2025)	50.0
08	TAU-bench Tool Calling	GPT-5 Medium (August 2025)	48.0
09	HAL Generalist Agent	Claude Opus 4 High (May 2025)	44.0
10	HAL Generalist Agent	Claude Opus 4 (May 2025)	44.0

AssistantBench

Time-consuming, realistic web tasks that require browsing many live pages to find one answer.

Deep dive Source

#	Agent System	Model	Score
01	Browser-Use	—	38.8

CORE-Bench Hard

Reproduce the results of published research papers from their code and data, on the hardest setting.

Deep dive Source

#	Agent System	Model	Score
01	Claude Code Submitted by Nicholas Carlini Download main.py	Claude Opus 4.5	77.8
02	Claude Code Submitted by Nicholas Carlini Download main.py	Claude Sonnet 4.5 (September 2025)	62.2
03	CORE-Agent	Claude Opus 4.1 (August 2025)	51.1
04	Claude Code Submitted by Nicholas Carlini Download main.py	Claude Sonnet 4 (May 2025)	46.7
05	CORE-Agent	Claude Sonnet 4.5 High (September 2025)	44.4
06	CORE-Agent	Claude Opus 4.5 High (November 2025)	42.2
07	CORE-Agent	Claude Opus 4.1 High (August 2025)	42.2
08	Claude Code Submitted by Nicholas Carlini Download main.py	Claude Opus 4.1	42.2
09	CORE-Agent	Claude Opus 4.5 (November 2025)	42.2
10	CORE-Agent	Gemini 3 Pro Preview High (November 2025)	40.0

Online-Mind2Web

Live website tasks that test whether a browser agent can complete real actions on the open web.

Deep dive Source

#	Agent System	Model	Score
01	SeeAct	GPT-5 Medium (August 2025)	42.3
02	Browser-Use	Claude Sonnet 4 (May 2025)	40.0
03	Browser-Use	Claude Sonnet 4 High (May 2025)	39.3
04	Browser-Use	Claude-3.7 Sonnet High (February 2025)	39.3
05	SeeAct	o3 Medium (April 2025)	39.0
06	Browser-Use	Claude-3.7 Sonnet (February 2025)	38.3
07	SeeAct	Claude Sonnet 4 High (May 2025)	36.7
08	SeeAct	Claude Sonnet 4 (May 2025)	36.7
09	Browser-Use	GPT-4.1 (April 2025)	36.3
10	Browser-Use	DeepSeek V3 (March 2025)	32.3

SciCode

Research-grade scientific coding problems drawn from real physics, biology, and chemistry work.

Deep dive Source

#	Agent System	Model	Score
01	Scicode Tool Calling Agent	o3 Medium (April 2025)	9.2
02	Scicode Zero Shot Agent	o4-mini Low (April 2025)	9.2
03	Scicode Tool Calling Agent	Claude Opus 4.1 (August 2025)	7.7
04	Scicode Tool Calling Agent	Claude Opus 4.1 High (August 2025)	6.9
05	Scicode Tool Calling Agent	GPT-5 Medium (August 2025)	6.2
06	Scicode Zero Shot Agent	o4-mini High (April 2025)	6.2
07	HAL Generalist Agent	o4-mini Low (April 2025)	6.2
08	Scicode Zero Shot Agent	GPT-4.1 (April 2025)	6.2
09	Scicode Tool Calling Agent	Claude-3.7 Sonnet High (February 2025)	4.6
10	Scicode Tool Calling Agent	o4-mini High (April 2025)	4.6

ScienceAgentBench

Data-driven scientific discovery tasks that ask an agent to run an analysis and produce a correct result.

Deep dive Source

#	Agent System	Model	Score
01	SAB Self-Debug	—	33.3
02	HAL Generalist Agent	—	21.6

Agent Frameworks Directory

See the scaffolds behind these scores ranked head to head: stars, downloads, capabilities, and trade-offs across LangGraph, CrewAI, Mastra, and more.

Browse Frameworks

Budget It

Agent Cost Calculator

A strong score means little if the agent costs a fortune to run. Estimate per-task and monthly spend for your workflow with live model prices.

Calculate Cost

Go Deeper

AI Benchmarks Library

Plain-language deep dives on the model benchmarks behind the agents: what each one measures, who tops it, and how they correlate.

Explore Benchmarks

What the Agent Leaderboard Measures

An agent benchmark leaderboard ranks complete agent systems, a model plus its framework and agent type, by how well they finish real tasks like fixing code, browsing the web, and following a support policy, not by trivia scores.

A model benchmark asks a single question and grades the answer. An agent benchmark gives a goal and a set of tools, then checks whether the system actually got the job done: did the bug fix pass the tests, did the booking change go through, did the research question get the right number. That is why the same model can land far apart on two different agent setups.

This leaderboard pulls results from the benchmarks that test those skills and ranks each agent system by its average across everything it has been tested on. Use the filters to compare a single model across frameworks, or a single framework across models, and read each benchmark section to see where a system is strong or weak.

Benchmarks: GAIA · SWE-bench · Tau-Bench + more
Filter by: Model · Framework · Agent type
Refresh: Weekly auto-sync
Embed: Top-3 row for citations

Agent Leaderboard FAQ

How Is the Overall Ranking Calculated?

Each system gets the average of its scores across the benchmarks it has been tested on, after every score is put on the same 0 to 100 scale. A system has to appear on at least three benchmarks to be ranked, so one lucky result cannot top the table. Because agents run different benchmark mixes, treat the overall number as a useful summary, not a precise head-to-head.

Why Do You Rank Agent Systems Instead of Models?

A bare model cannot browse the web, run tests, or call tools. Real agent performance comes from the model plus the framework and agent loop around it. Two systems on the same model often score very differently, so we track the whole system and let you filter down to the model or framework you care about.

How Often Does the Leaderboard Update?

A sync job runs weekly and pulls fresh results from the public benchmark feeds we track. Scores entered by hand for benchmarks without a clean data feed stay put between syncs.

Can I Put This Leaderboard on My Own Site?

Yes. The embed shows the top three agent systems in a compact view with a link back to the full page, which makes it easy to cite in a blog post or report.

Which Benchmarks Are Included?

The first release covers the benchmarks with a reliable machine-readable feed: GAIA, SWE-bench Verified and its mini subset, Tau-Bench Airline, AssistantBench, CORE-Bench Hard, Online-Mind2Web, SciCode, ScienceAgentBench, and USACO. More are added as clean data sources become available.

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

Agent Benchmark Leaderboard

The Live Agent Benchmark Leaderboard

How the best AI agent systems stack up across the benchmarks that test real work, not just model trivia. Updated weekly.

ModelFrameworkAgent Type

16 systems

Overall Agent Ranking

Ranked by average score across every benchmark a system has been tested on. Systems need results on at least three benchmarks to appear here.


01	HAL Generalist Agent · Claude Opus 4.1 (August 2025)	Claude Opus 4.1 (August 2025)	HAL Generalist Agent	Research	4	49.0
02	HAL Generalist Agent · Claude Sonnet 4.5 (September 2025)	Claude Sonnet 4.5 (September 2025)	HAL Generalist Agent	Research	3	47.3
03	HAL Generalist Agent · Claude Sonnet 4.5 High (September 2025)	Claude Sonnet 4.5 High (September 2025)	HAL Generalist Agent	Research	3	46.6
04	HAL Generalist Agent · Claude Opus 4 High (May 2025)	Claude Opus 4 High (May 2025)	HAL Generalist Agent	Tool Calling	3	46.3
05	HAL Generalist Agent · Claude Opus 4.1 High (August 2025)	Claude Opus 4.1 High (August 2025)	HAL Generalist Agent	Research	4	45.0
06	HAL Generalist Agent · Claude Opus 4 (May 2025)	Claude Opus 4 (May 2025)	HAL Generalist Agent	Tool Calling	3	36.1
07	HAL Generalist Agent · Claude-3.7 Sonnet High (February 2025)	Claude-3.7 Sonnet High (February 2025)	HAL Generalist Agent	Coding	5	34.6
08	HAL Generalist Agent · Claude-3.7 Sonnet (February 2025)	Claude-3.7 Sonnet (February 2025)	HAL Generalist Agent	Coding	5	34.5
09	HAL Generalist Agent · GPT-5 Medium (August 2025)	GPT-5 Medium (August 2025)	HAL Generalist Agent	Research	4	28.1
10	HAL Generalist Agent · o4-mini High (April 2025)	o4-mini High (April 2025)	HAL Generalist Agent	Coding	5	22.3
11	HAL Generalist Agent · o4-mini Low (April 2025)	o4-mini Low (April 2025)	HAL Generalist Agent	Coding	5	21.6
12	HAL Generalist Agent · GPT-4.1 (April 2025)	GPT-4.1 (April 2025)	HAL Generalist Agent	Coding	6	19.5
13	HAL Generalist Agent · o3 Medium (April 2025)	o3 Medium (April 2025)	HAL Generalist Agent	Coding	5	14.8
14	HAL Generalist Agent · DeepSeek V3 (March 2025)	DeepSeek V3 (March 2025)	HAL Generalist Agent	Coding	5	13.3
15	HAL Generalist Agent · Gemini 2.0 Flash (February 2025)	Gemini 2.0 Flash (February 2025)	HAL Generalist Agent	Coding	5	12.2
16	HAL Generalist Agent · DeepSeek R1 (January 2025)	DeepSeek R1 (January 2025)	HAL Generalist Agent	Coding	5	10.2

Per-Benchmark Leaderboards

The top agent systems on each individual benchmark. Filters above apply here too.

GAIA

Real-world assistant questions that need web browsing, tool use, and multi-step reasoning to answer correctly.

Deep dive Source

#	Agent System	Model	Score
01	HAL Generalist Agent	Claude Sonnet 4.5 (September 2025)	74.5
02	HAL Generalist Agent	Claude Sonnet 4.5 High (September 2025)	70.9
03	HAL Generalist Agent	Claude Opus 4.1 High (August 2025)	68.5
04	HAL Generalist Agent	Claude Opus 4 High (May 2025)	64.8
05	HAL Generalist Agent	Claude Opus 4.1 (August 2025)	64.2
06	HAL Generalist Agent	Claude-3.7 Sonnet High (February 2025)	64.2
07	HF Open Deep Research	GPT-5 Medium (August 2025)	62.8
08	HAL Generalist Agent	GPT-5 Medium (August 2025)	59.4
09	HAL Generalist Agent	o4-mini Low (April 2025)	58.2
10	HF Open Deep Research	Claude Opus 4 (May 2025)	57.6

SWE-bench Verified Mini

A smaller, cheaper slice of SWE-bench Verified used to compare coding agents without a huge compute bill.

Deep dive Source

#	Agent System	Model	Score
01	SWE-Agent	Claude Sonnet 4.5 High (September 2025)	72.0
02	SWE-Agent	Claude Sonnet 4.5 (September 2025)	68.0
03	SWE-Agent	Claude Opus 4.1 (August 2025)	61.0
04	SWE-Agent	Claude Opus 4.1 High (August 2025)	54.0
05	SWE-Agent	Claude-3.7 Sonnet High (February 2025)	54.0
06	SWE-Agent	o4-mini Low (April 2025)	54.0
07	SWE-Agent	Claude Opus 4 (May 2025)	50.0
08	SWE-Agent	Claude-3.7 Sonnet (February 2025)	50.0
09	SWE-Agent	o4-mini High (April 2025)	50.0
10	SWE-Agent	o3 Medium (April 2025)	46.0

Tau-Bench Airline

A customer-service simulation where an agent has to follow airline policy and use tools to actually resolve a request.

Deep dive Source

#	Agent System	Model	Score
01	HAL Generalist Agent	Claude-3.7 Sonnet (February 2025)	56.0
02	TAU-bench Tool Calling	o4-mini High (April 2025)	56.0
03	HAL Generalist Agent	Claude Opus 4.1 (August 2025)	54.0
04	TAU-bench Tool Calling	o3 Medium (April 2025)	54.0
05	TAU-bench Tool Calling	Claude Opus 4.1 High (August 2025)	52.0
06	TAU-bench Tool Calling	Claude-3.7 Sonnet High (February 2025)	52.0
07	TAU-bench Tool Calling	Claude Opus 4.1 (August 2025)	50.0
08	TAU-bench Tool Calling	GPT-5 Medium (August 2025)	48.0
09	HAL Generalist Agent	Claude Opus 4 High (May 2025)	44.0
10	HAL Generalist Agent	Claude Opus 4 (May 2025)	44.0

AssistantBench

Time-consuming, realistic web tasks that require browsing many live pages to find one answer.

Deep dive Source

#	Agent System	Model	Score
01	Browser-Use	—	38.8

CORE-Bench Hard

Reproduce the results of published research papers from their code and data, on the hardest setting.

Deep dive Source

#	Agent System	Model	Score
01	Claude Code Submitted by Nicholas Carlini Download main.py	Claude Opus 4.5	77.8
02	Claude Code Submitted by Nicholas Carlini Download main.py	Claude Sonnet 4.5 (September 2025)	62.2
03	CORE-Agent	Claude Opus 4.1 (August 2025)	51.1
04	Claude Code Submitted by Nicholas Carlini Download main.py	Claude Sonnet 4 (May 2025)	46.7
05	CORE-Agent	Claude Sonnet 4.5 High (September 2025)	44.4
06	CORE-Agent	Claude Opus 4.5 High (November 2025)	42.2
07	CORE-Agent	Claude Opus 4.1 High (August 2025)	42.2
08	Claude Code Submitted by Nicholas Carlini Download main.py	Claude Opus 4.1	42.2
09	CORE-Agent	Claude Opus 4.5 (November 2025)	42.2
10	CORE-Agent	Gemini 3 Pro Preview High (November 2025)	40.0

Online-Mind2Web

Live website tasks that test whether a browser agent can complete real actions on the open web.

Deep dive Source

#	Agent System	Model	Score
01	SeeAct	GPT-5 Medium (August 2025)	42.3
02	Browser-Use	Claude Sonnet 4 (May 2025)	40.0
03	Browser-Use	Claude Sonnet 4 High (May 2025)	39.3
04	Browser-Use	Claude-3.7 Sonnet High (February 2025)	39.3
05	SeeAct	o3 Medium (April 2025)	39.0
06	Browser-Use	Claude-3.7 Sonnet (February 2025)	38.3
07	SeeAct	Claude Sonnet 4 High (May 2025)	36.7
08	SeeAct	Claude Sonnet 4 (May 2025)	36.7
09	Browser-Use	GPT-4.1 (April 2025)	36.3
10	Browser-Use	DeepSeek V3 (March 2025)	32.3

SciCode

Research-grade scientific coding problems drawn from real physics, biology, and chemistry work.

Deep dive Source

#	Agent System	Model	Score
01	Scicode Tool Calling Agent	o3 Medium (April 2025)	9.2
02	Scicode Zero Shot Agent	o4-mini Low (April 2025)	9.2
03	Scicode Tool Calling Agent	Claude Opus 4.1 (August 2025)	7.7
04	Scicode Tool Calling Agent	Claude Opus 4.1 High (August 2025)	6.9
05	Scicode Tool Calling Agent	GPT-5 Medium (August 2025)	6.2
06	Scicode Zero Shot Agent	o4-mini High (April 2025)	6.2
07	HAL Generalist Agent	o4-mini Low (April 2025)	6.2
08	Scicode Zero Shot Agent	GPT-4.1 (April 2025)	6.2
09	Scicode Tool Calling Agent	Claude-3.7 Sonnet High (February 2025)	4.6
10	Scicode Tool Calling Agent	o4-mini High (April 2025)	4.6

ScienceAgentBench

Data-driven scientific discovery tasks that ask an agent to run an analysis and produce a correct result.

Deep dive Source

#	Agent System	Model	Score
01	SAB Self-Debug	—	33.3
02	HAL Generalist Agent	—	21.6

USACO

Competitive programming problems that demand real algorithmic reasoning, not just boilerplate code.

Deep dive Source

#	Agent System	Model	Score
01	USACO Episodic + Semantic	GPT-5 Medium (August 2025)	69.1
02	USACO Episodic + Semantic	o4-mini High (April 2025)	64.8
03	USACO Episodic + Semantic	o4-mini Low (April 2025)	53.1
04	USACO Episodic + Semantic	Claude Opus 4.1 High (August 2025)	51.5
05	USACO Episodic + Semantic	Claude Opus 4.1 (August 2025)	48.2
06	USACO Episodic + Semantic	o3 Medium (April 2025)	46.3
07	USACO Episodic + Semantic	GPT-4.1 (April 2025)	45.0
08	USACO Episodic + Semantic	DeepSeek V3 (March 2025)	39.1
09	USACO Episodic + Semantic	DeepSeek R1 (January 2025)	38.1
10	USACO Episodic + Semantic	Claude-3.7 Sonnet (February 2025)	29.3

Pick a Stack

Agent Frameworks Directory

See the scaffolds behind these scores ranked head to head: stars, downloads, capabilities, and trade-offs across LangGraph, CrewAI, Mastra, and more.

Browse Frameworks

Budget It

Agent Cost Calculator

A strong score means little if the agent costs a fortune to run. Estimate per-task and monthly spend for your workflow with live model prices.

Calculate Cost

Go Deeper

AI Benchmarks Library

Plain-language deep dives on the model benchmarks behind the agents: what each one measures, who tops it, and how they correlate.

Explore Benchmarks

What the Agent Leaderboard Measures

Benchmarks: GAIA · SWE-bench · Tau-Bench + more
Filter by: Model · Framework · Agent type
Refresh: Weekly auto-sync
Embed: Top-3 row for citations

Agent Leaderboard FAQ

How Is the Overall Ranking Calculated?

Why Do You Rank Agent Systems Instead of Models?

How Often Does the Leaderboard Update?

A sync job runs weekly and pulls fresh results from the public benchmark feeds we track. Scores entered by hand for benchmarks without a clean data feed stay put between syncs.

Can I Put This Leaderboard on My Own Site?

Yes. The embed shows the top three agent systems in a compact view with a link back to the full page, which makes it easy to cite in a blog post or report.

Which Benchmarks Are Included?

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.