Agent Benchmark · 2023

GAIA: General AI Assistants

Name: GAIA: General AI Assistants
Creator: Meta AI, Hugging Face, and AutoGPT
Published: 2023
Keywords: GAIA, AI agent benchmark, Meta AI, Hugging Face, and AutoGPT

Real-world assistant questions that need web browsing, tool use, and multi-step reasoning to answer correctly.

View Dataset Read Paper

Systems Ranked

Top Score

74.5

Published

2023

Source

Meta AI, Hugging Face, and AutoGPT

How It Works

GAIA asks an agent the kind of question a capable human assistant should be able to answer: look something up across a few websites, read a file, do a small calculation, and return one exact answer. The questions are easy to check but hard to solve, because they need several real actions chained together rather than one model call.

Each question has a single unambiguous answer that is graded by exact match. Agents are free to browse the web, run code, and open attached files. Scores are reported as the percentage of questions answered correctly, often split by the three difficulty levels.

Dataset size

466 questions across three difficulty levels.

Agent type

Research

Published by

Meta AI, Hugging Face, and AutoGPT

Year

2023

Top Agent Systems

#	Agent System	Model	Score
01	HAL Generalist Agent	Claude Sonnet 4.5 (September 2025)	74.5
02	HAL Generalist Agent	Claude Sonnet 4.5 High (September 2025)	70.9
03	HAL Generalist Agent	Claude Opus 4.1 High (August 2025)	68.5
04	HAL Generalist Agent	Claude Opus 4 High (May 2025)	64.8
05	HAL Generalist Agent	Claude Opus 4.1 (August 2025)	64.2
06	HAL Generalist Agent	Claude-3.7 Sonnet High (February 2025)	64.2
07	HF Open Deep Research	GPT-5 Medium (August 2025)	62.8
08	HAL Generalist Agent	GPT-5 Medium (August 2025)	59.4
09	HAL Generalist Agent	o4-mini Low (April 2025)	58.2
10	HF Open Deep Research	Claude Opus 4 (May 2025)	57.6
11	HAL Generalist Agent	Claude Haiku 4.5 (October 2025)	56.4
12	HAL Generalist Agent	Claude-3.7 Sonnet (February 2025)	56.4
13	HF Open Deep Research	o4-mini High (April 2025)	55.8
14	HAL Generalist Agent	o4-mini High (April 2025)	54.5
15	HF Open Deep Research	GPT-4.1 (April 2025)	50.3
16	HAL Generalist Agent	GPT-4.1 (April 2025)	49.7
17	HF Open Deep Research	o4-mini Low (April 2025)	47.9
18	HF Open Deep Research	Claude-3.7 Sonnet (February 2025)	37.0
19	HF Open Deep Research	Claude-3.7 Sonnet High (February 2025)	35.8
20	HF Open Deep Research	o3 Medium (April 2025)	32.7
21	HAL Generalist Agent	Gemini 2.0 Flash (February 2025)	32.7
22	HF Open Deep Research	Claude Sonnet 4.5 High (September 2025)	30.9
23	HF Open Deep Research	Claude Sonnet 4.5 (September 2025)	30.9
24	HAL Generalist Agent	Claude Opus 4 (May 2025)	30.3
25	HAL Generalist Agent	DeepSeek R1 (January 2025)	30.3

Strengths

Answers are exact-match, so scoring is objective and hard to game.
Covers the full assistant loop: search, read, compute, and answer.
Adopted widely, so agent systems are directly comparable.

Limitations

A few hundred questions means small score swings are noisy.
Live web pages drift, so old runs are not perfectly reproducible.
Rewards careful tool use more than raw model intelligence.

Frequently Asked Questions

What does a strong GAIA score look like?

Humans score around 92% on GAIA. The best agent systems in 2026 are in the 60–75% range on the validation set, and most score well below that, which is why GAIA is still a useful separator.

Why is GAIA an agent benchmark and not a model benchmark?

A bare language model cannot browse the web or open a file. GAIA only works when the model is wrapped in a system that can take actions, so it measures the model plus its scaffolding, not the model alone.

Other Agent Benchmarks

Browse the other benchmarks on the leaderboard.

Research

CORE-Bench Hard

Reproduce the results of published research papers from their code and data, on the hardest setting.

Research

ScienceAgentBench

Data-driven scientific discovery tasks that ask an agent to run an analysis and produce a correct result.

Coding

SWE-bench Verified

Real GitHub issues an agent has to fix by editing a codebase until the project test suite passes.

Coding

SWE-bench Verified Mini

A smaller, cheaper slice of SWE-bench Verified used to compare coding agents without a huge compute bill.

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

Agent Benchmark · 2023

GAIA: General AI Assistants

Real-world assistant questions that need web browsing, tool use, and multi-step reasoning to answer correctly.

View Dataset Read Paper

Systems Ranked

Top Score

74.5

Published

2023

Source

Meta AI, Hugging Face, and AutoGPT

How It Works

Dataset size

466 questions across three difficulty levels.

Agent type

Research

Published by

Meta AI, Hugging Face, and AutoGPT

Year

2023

Top Agent Systems

#	Agent System	Model	Score
01	HAL Generalist Agent	Claude Sonnet 4.5 (September 2025)	74.5
02	HAL Generalist Agent	Claude Sonnet 4.5 High (September 2025)	70.9
03	HAL Generalist Agent	Claude Opus 4.1 High (August 2025)	68.5
04	HAL Generalist Agent	Claude Opus 4 High (May 2025)	64.8
05	HAL Generalist Agent	Claude Opus 4.1 (August 2025)	64.2
06	HAL Generalist Agent	Claude-3.7 Sonnet High (February 2025)	64.2
07	HF Open Deep Research	GPT-5 Medium (August 2025)	62.8
08	HAL Generalist Agent	GPT-5 Medium (August 2025)	59.4
09	HAL Generalist Agent	o4-mini Low (April 2025)	58.2
10	HF Open Deep Research	Claude Opus 4 (May 2025)	57.6
11	HAL Generalist Agent	Claude Haiku 4.5 (October 2025)	56.4
12	HAL Generalist Agent	Claude-3.7 Sonnet (February 2025)	56.4
13	HF Open Deep Research	o4-mini High (April 2025)	55.8
14	HAL Generalist Agent	o4-mini High (April 2025)	54.5
15	HF Open Deep Research	GPT-4.1 (April 2025)	50.3
16	HAL Generalist Agent	GPT-4.1 (April 2025)	49.7
17	HF Open Deep Research	o4-mini Low (April 2025)	47.9
18	HF Open Deep Research	Claude-3.7 Sonnet (February 2025)	37.0
19	HF Open Deep Research	Claude-3.7 Sonnet High (February 2025)	35.8
20	HF Open Deep Research	o3 Medium (April 2025)	32.7
21	HAL Generalist Agent	Gemini 2.0 Flash (February 2025)	32.7
22	HF Open Deep Research	Claude Sonnet 4.5 High (September 2025)	30.9
23	HF Open Deep Research	Claude Sonnet 4.5 (September 2025)	30.9
24	HAL Generalist Agent	Claude Opus 4 (May 2025)	30.3
25	HAL Generalist Agent	DeepSeek R1 (January 2025)	30.3

Strengths

Answers are exact-match, so scoring is objective and hard to game.
Covers the full assistant loop: search, read, compute, and answer.
Adopted widely, so agent systems are directly comparable.

Limitations

A few hundred questions means small score swings are noisy.
Live web pages drift, so old runs are not perfectly reproducible.
Rewards careful tool use more than raw model intelligence.

Frequently Asked Questions

What does a strong GAIA score look like?

Humans score around 92% on GAIA. The best agent systems in 2026 are in the 60–75% range on the validation set, and most score well below that, which is why GAIA is still a useful separator.

Why is GAIA an agent benchmark and not a model benchmark?

Other Agent Benchmarks

Browse the other benchmarks on the leaderboard.

Research

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.