Benchmark Library

AI Benchmarks Explained

The 34 most-watched benchmarks for modern AI models across text, image, video, audio, and embedding. What each one measures, who scores highest, how scores spread, and which benchmarks correlate with which.

Pick a Benchmark

GPQA

Expert-written science questions that PhD researchers can barely solve and Google searches cannot answer.

34 models tested

Top scorer: GPT-5.2 (93.2)

MMLU-PRO

A harder, harder-to-game replacement for the original MMLU, covering reasoning across 14 academic and professional subjects.

19 models tested

Top scorer: Gemini 3 Flash (88.6)

GSM8K

Eighty-five hundred word problems that test whether a model can do multi-step arithmetic reasoning, not just recall.

3 models tested

Top scorer: DeepSeek-V4-Pro (92.6)

SWE-Verified

Five hundred real GitHub issues, hand-checked by engineers, that test whether a model can ship a working code change.

25 models tested

Top scorer: Claude Opus 4.6 (80.8)

HLE

Twenty-five hundred expert-written questions designed to be unsolvable by any current AI system, across every academic field.

27 models tested

Top scorer: GPT-5.4 (52.1)

AIME 2026

Fifteen elite high-school competition math problems used as a yearly stress test for chain-of-thought reasoning.

24 models tested

Top scorer: GPT-5.2 (100.0)

Terminal Bench

A live agent test that drops a model into a real Linux shell and asks it to complete real engineering tasks.

24 models tested

Top scorer: Claude Opus 4.6 (74.7)

SWE-Pro

Long-horizon, enterprise-style coding tasks that take human engineers hours, not minutes.

19 models tested

Top scorer: Gemini 3 Flash (71.2)

EvasionBench

Sixteen thousand earnings-call Q&A pairs that test whether a model can spot when an executive is dodging the question.

3 models tested

Top scorer: GLM-4.7 (82.9)

olmOCR

Fourteen hundred real PDFs that test whether a model can turn messy documents into clean, structured markdown.

0 models tested

No scores yet for this benchmark.

HMMT 2026

Elite university-level competition math problems used as a 2026-fresh test of advanced reasoning.

12 models tested

Top scorer: Kimi K2.6 (92.7)

Arena Score

Open head-to-head human preference rankings for chat models, the most-watched live leaderboard in AI.

143 models tested

Top scorer: Claude Opus 4.6 (Thinking) (100.0)

WebDev Arena

Head-to-head human preference ranking for models that turn natural-language prompts into working web apps.

0 models tested

No scores yet for this benchmark.

Image-to-WebDev

Head-to-head ranking for models that turn a screenshot or mockup into a working web app.

0 models tested

No scores yet for this benchmark.

Search Arena

Head-to-head ranking for models that answer real questions using web search and citations.

0 models tested

No scores yet for this benchmark.

Vision Arena

Head-to-head ranking for vision-language models on real image-understanding prompts.

0 models tested

No scores yet for this benchmark.

Document Arena

Head-to-head ranking for models that read PDFs, slides, and long screenshots to answer real questions.

0 models tested

No scores yet for this benchmark.

Image Arena

Head-to-head human preference ranking for text-to-image and image-edit models, run by Arena.ai.

0 models tested

No scores yet for this benchmark.

Image Edit Arena

Head-to-head ranking for models that edit an input image given a text instruction.

0 models tested

No scores yet for this benchmark.

GenEval

Object-focused prompts that test whether a generator gets counts, positions, colors, and attributes right.

3 models tested

Top scorer: BAGEL-7B-MoT (88.0)

HPS v2

A reward model that predicts what humans will prefer, trained on hundreds of thousands of real preference labels.

0 models tested

No scores yet for this benchmark.

ImageReward

A reward model that judges text-image alignment, fidelity, and aesthetic quality on a single combined score.

0 models tested

No scores yet for this benchmark.

Video Arena

Head-to-head human preference ranking for text-to-video, image-to-video, and video-edit models.

0 models tested

No scores yet for this benchmark.

Image-to-Video Arena

Head-to-head ranking for models that animate a still input image, with or without a text instruction.

0 models tested

No scores yet for this benchmark.

Video Edit Arena

Head-to-head ranking for models that edit an input clip given a text instruction.

0 models tested

No scores yet for this benchmark.

VBench

Sixteen-dimension benchmark covering temporal coherence, subject consistency, motion quality, and prompt fidelity.

2 models tested

Top scorer: Wan2.2-T2V-A14B (86.2)

TTS Arena

Blind head-to-head listening test for text-to-speech models, ranked by Bradley-Terry on pairwise wins.

0 models tested

No scores yet for this benchmark.

WER

The standard accuracy metric for speech-to-text: lower is better.

30 models tested

Top scorer: Fish Speech v1.4 (1.0%)

MOS

Classic 1–5 listener rating for TTS naturalness. The longest-running quality metric in speech.

6 models tested

Top scorer: CosyVoice 2.0 (5.0 / 5)

MTEB Overall

The single number most teams quote when comparing embedding models. Aggregates 56 datasets across 8 task types.

18 models tested

Top scorer: harrier-oss-v1-27b (74.3)

MTEB Retrieval

The retrieval slice of MTEB. The most important sub-score if you are building RAG.

27 models tested

Top scorer: harrier-oss-v1-27b (78.3)

MTEB Classification

Tests whether the embedding captures enough semantic structure for downstream classifiers to work.

27 models tested

Top scorer: harrier-oss-v1-27b (80.0)

MTEB Clustering

Tests whether semantically similar items end up close together in the embedding space.

27 models tested

Top scorer: F2LLM-v2-14B (60.9)

MTEB STS

Measures whether the embedding distance between two sentences matches human similarity judgments.

27 models tested

Top scorer: Octen-Embedding-8B (81.3)

Picking the Right Model for Your Use Case?

We help product and engineering teams turn benchmark scores into shipped features. Free first conversation.