Twenty-five hundred expert-written questions designed to be unsolvable by any current AI system, across every academic field.
HLE is the hardest broad-coverage benchmark in public use. The questions were crowdsourced from a thousand subject experts and explicitly filtered to defeat frontier models at the time of release. About 14% are multimodal, requiring image understanding. HLE measures how close a model is to the ceiling of human expert knowledge — and how much further the field still has to go.
Questions are short-answer or multiple choice. Scoring is exact-match for short-answer items and accuracy for multiple choice. Many questions include an image or diagram, so a fair score requires a multimodal model.
| # | Model | Lab | Source | Score |
|---|---|---|---|---|
| 01 | GPT-5.4 | OpenAI | Closed | 52.1 |
| 02 | Kimi K2.5 | Moonshot AI | Open | 50.2 |
| 03 | DeepSeek-V3.2 | DeepSeek | Open | 40.8 |
| 04 | Claude Opus 4.6 | Anthropic | Closed | 40.0 |
| 05 | DeepSeek-V4-Pro | DeepSeek | Open | 37.7 |
| 06 | Gemini 3 Pro | Closed | 37.5 | |
| 07 | GPT-5.2 | OpenAI |
7 model(s) with undisclosed parameter counts not shown. Most closed-source labs do not publish model size.
The authors built it to be a benchmark that humanity might run out of room to keep designing. The questions are at or beyond the level of a top expert in each field, which makes it useful even as models improve dramatically.
Most strong open-weight models score under 10%. Frontier closed models in 2026 are between 20% and 35%. Even the best models are far from human-expert performance, which is the explicit design goal.
About 14% of items have images. Text-only models can still be evaluated on the remaining 86%, but the official score assumes full multimodal capability. Compare like with like when reading leaderboards.
| Closed |
| 35.4 |
| 08 | DeepSeek-V4-Flash | DeepSeek | Open | 34.8 |
| 09 | Kimi K2.6 | Moonshot AI | Open | 34.7 |
| 10 | Gemini 3 Flash | Closed | 33.7 |
| 11 | GLM-5.1 | Z.ai | Open | 31.0 |
| 12 | Claude Sonnet 4.5 | Anthropic | Closed | 30.8 |
| 13 | GLM-5 | Z.ai | Open | 30.5 |
| 14 | Qwen3.5-397B-A17B | Alibaba | Open | 28.7 |
| 15 | GPT-5.1 | OpenAI | Closed | 26.0 |