Real-world assistant questions that need web browsing, tool use, and multi-step reasoning to answer correctly.
GAIA asks an agent the kind of question a capable human assistant should be able to answer: look something up across a few websites, read a file, do a small calculation, and return one exact answer. The questions are easy to check but hard to solve, because they need several real actions chained together rather than one model call.
Each question has a single unambiguous answer that is graded by exact match. Agents are free to browse the web, run code, and open attached files. Scores are reported as the percentage of questions answered correctly, often split by the three difficulty levels.
| # | Agent System | Model | Score |
|---|---|---|---|
| 01 | HAL Generalist Agent | Claude Sonnet 4.5 (September 2025) | 74.5 |
| 02 | HAL Generalist Agent | Claude Sonnet 4.5 High (September 2025) | 70.9 |
| 03 | HAL Generalist Agent | Claude Opus 4.1 High (August 2025) | 68.5 |
| 04 | HAL Generalist Agent | Claude Opus 4 High (May 2025) | 64.8 |
| 05 | HAL Generalist Agent | Claude Opus 4.1 (August 2025) | 64.2 |
| 06 | HAL Generalist Agent | Claude-3.7 Sonnet High (February 2025) | 64.2 |
| 07 | HF Open Deep Research | GPT-5 Medium (August 2025) | 62.8 |
| 08 | HAL Generalist Agent | GPT-5 Medium (August 2025) | 59.4 |
| 09 | HAL Generalist Agent | o4-mini Low (April 2025) | 58.2 |
| 10 | HF Open Deep Research | Claude Opus 4 (May 2025) | 57.6 |
| 11 | HAL Generalist Agent | Claude Haiku 4.5 (October 2025) | 56.4 |
| 12 | HAL Generalist Agent | Claude-3.7 Sonnet (February 2025) | 56.4 |
| 13 | HF Open Deep Research | o4-mini High (April 2025) | 55.8 |
| 14 | HAL Generalist Agent | o4-mini High (April 2025) | 54.5 |
| 15 | HF Open Deep Research | GPT-4.1 (April 2025) | 50.3 |
| 16 | HAL Generalist Agent | GPT-4.1 (April 2025) | 49.7 |
| 17 | HF Open Deep Research | o4-mini Low (April 2025) | 47.9 |
| 18 | HF Open Deep Research | Claude-3.7 Sonnet (February 2025) | 37.0 |
| 19 | HF Open Deep Research | Claude-3.7 Sonnet High (February 2025) | 35.8 |
| 20 | HF Open Deep Research | o3 Medium (April 2025) | 32.7 |
| 21 | HAL Generalist Agent | Gemini 2.0 Flash (February 2025) | 32.7 |
| 22 | HF Open Deep Research | Claude Sonnet 4.5 High (September 2025) | 30.9 |
| 23 | HF Open Deep Research | Claude Sonnet 4.5 (September 2025) | 30.9 |
| 24 | HAL Generalist Agent | Claude Opus 4 (May 2025) | 30.3 |
| 25 | HAL Generalist Agent | DeepSeek R1 (January 2025) | 30.3 |
Humans score around 92% on GAIA. The best agent systems in 2026 are in the 60–75% range on the validation set, and most score well below that, which is why GAIA is still a useful separator.
A bare language model cannot browse the web or open a file. GAIA only works when the model is wrapped in a system that can take actions, so it measures the model plus its scaffolding, not the model alone.
Browse the other benchmarks on the leaderboard.