Twenty-five hundred expert-written questions designed to be unsolvable by any current AI system, across every academic field.
HLE is the hardest broad-coverage benchmark in public use. The questions were crowdsourced from a thousand subject experts and explicitly filtered to defeat frontier models at the time of release. About 14% are multimodal, requiring image understanding. HLE measures how close a model is to the ceiling of human expert knowledge, and how much further the field still has to go.
Questions are short-answer or multiple choice. Scoring is exact-match for short-answer items and accuracy for multiple choice. Many questions include an image or diagram, so a fair score requires a multimodal model.
| # | Model | Lab | Source | Score |
|---|---|---|---|---|
| 01 | Gemini 3.1 Pro Preview | Closed | 44.7 | |
| 02 | GPT-5.5 | OpenAI | Closed | 44.3 |
| 03 | GPT-5.4 | OpenAI | Closed | 41.6 |
| 04 | GPT-5.4 High | OpenAI | Closed | 41.6 |
| 05 | Gemini 3.5 Flash | Closed | 41.0 | |
| 06 | Claude Opus 4.7 Thinking | Anthropic | Closed | 39.6 |
| 07 | Claude Opus 4.7 | Anthropic | Closed | 39.6 |
| 08 | Qwen 3.7 Max | Alibaba | Closed | 38.1 |
| 09 | Gemini 3 Pro | Closed | 37.2 | |
| 10 | Claude Opus 4.6 (Thinking) | Anthropic | Closed | 36.7 |
| 11 | DeepSeek-V4-Pro | DeepSeek | Open | 35.9 |
| 12 | Kimi K2.6 | Moonshot AI | Open | 35.9 |
| 13 | GPT-5.2 | OpenAI | Closed | 35.4 |
| 14 | GPT-5.2 High | OpenAI | Closed | 35.4 |
| 15 | Grok 4.3 | xAI | Closed | 35.0 |
76 model(s) with undisclosed parameter counts not shown. Most closed-source labs do not publish model size.
The authors built it to be a benchmark that humanity might run out of room to keep designing. The questions are at or beyond the level of a top expert in each field, which makes it useful even as models improve dramatically.
Most strong open-weight models score under 10%. Frontier closed models in 2026 are between 20% and 35%. Even the best models are far from human-expert performance, which is the explicit design goal.
About 14% of items have images. Text-only models can still be evaluated on the remaining 86%, but the official score assumes full multimodal capability. Compare like with like when reading leaderboards.
Based on score correlations across our database.