Benchmarks · 2025

HLE: Humanity's Last Exam

Name: HLE: Humanity's Last Exam
Creator: Center for AI Safety and Scale AI
Published: 2025
Keywords: HLE, AI benchmark, text model evaluation, Center for AI Safety and Scale AI

Twenty-five hundred expert-written questions designed to be unsolvable by any current AI system, across every academic field.

Open Dataset Read Paper

Models Tested

Top Score

52.1

Published

2025

Source

Center for AI Safety and Scale AI

How It Works

HLE is the hardest broad-coverage benchmark in public use. The questions were crowdsourced from a thousand subject experts and explicitly filtered to defeat frontier models at the time of release. About 14% are multimodal, requiring image understanding. HLE measures how close a model is to the ceiling of human expert knowledge — and how much further the field still has to go.

Questions are short-answer or multiple choice. Scoring is exact-match for short-answer items and accuracy for multiple choice. Many questions include an image or diagram, so a fair score requires a multimodal model.

Dataset size

2,500 questions across mathematics, physics, biology, chemistry, computer science, engineering, the humanities, and the social sciences.

Mean score

29.2

Median score

28.7

Open / Closed

20 / 7

Top Scorers

#	Model	Lab	Source	Score
01	GPT-5.4	OpenAI	Closed	52.1
02	Kimi K2.5	Moonshot AI	Open	50.2
03	DeepSeek-V3.2	DeepSeek	Open	40.8
04	Claude Opus 4.6	Anthropic	Closed	40.0
05	DeepSeek-V4-Pro	DeepSeek	Open	37.7
06	Gemini 3 Pro	Google	Closed	37.5
07	GPT-5.2	OpenAI

Score Distribution

Open vs Closed Source

Gap on HLE:+1.9pts closed leads

Top Open-Source Models

1Kimi K2.550.2
2DeepSeek-V3.240.8
3DeepSeek-V4-Pro37.7

Top Closed-Source Models

1GPT-5.452.1
2Claude Opus 4.640
3Gemini 3 Pro37.5

Score vs Parameter Count

7 model(s) with undisclosed parameter counts not shown. Most closed-source labs do not publish model size.

Average Score by Lab

OpenAI
37.8n = 3
DeepSeek
37.8n = 3
Moonshot AI
36.3n = 3
Anthropic
35.4n = 2
Z.ai
28.8n = 3

Most Correlated Benchmarks

AIME 2026
+0.76n = 20
GPQA
+0.68n = 27
Arena Score
+0.59n = 23
SWE-Verified
+0.53n = 23
Terminal Bench

What It Captures Well

Highest difficulty currently available — gives meaningful headroom for the next two to three years.
Broad coverage across every academic field, balancing GPQA and MMLU-PRO.
Multimodal, so it stress-tests image understanding inside a reasoning task.

Where It Falls Short

Frontier scores are still low (under 30% for most models), so differences can be noisy.
Expert-written questions sometimes drift toward trivia rather than reasoning.
Closed leaderboard requires careful eval setup to reproduce.

Frequently Asked Questions

Why is HLE called the last exam?

The authors built it to be a benchmark that humanity might run out of room to keep designing. The questions are at or beyond the level of a top expert in each field, which makes it useful even as models improve dramatically.

What is a good HLE score?

Most strong open-weight models score under 10%. Frontier closed models in 2026 are between 20% and 35%. Even the best models are far from human-expert performance, which is the explicit design goal.

Do I need a vision model to run HLE?

About 14% of items have images. Text-only models can still be evaluated on the remaining 86%, but the official score assumes full multimodal capability. Compare like with like when reading leaderboards.

Related Benchmarks

Based on score correlations across our database.

Pearson r +0.76

AIME 2026

n = 20

Picking the Right Model for Your Use Case?

We help product and engineering teams turn benchmark scores into shipped features. Free first conversation.