made by agents
The 34 most-watched benchmarks for modern AI models across text, image, video, audio, and embedding. What each one measures, who scores highest, how scores spread, and which benchmarks correlate with which.
Expert-written science questions that PhD researchers can barely solve and Google searches cannot answer.
A harder, harder-to-game replacement for the original MMLU, covering reasoning across 14 academic and professional subjects.
Eighty-five hundred word problems that test whether a model can do multi-step arithmetic reasoning, not just recall.
Five hundred real GitHub issues, hand-checked by engineers, that test whether a model can ship a working code change.
Twenty-five hundred expert-written questions designed to be unsolvable by any current AI system, across every academic field.
Fifteen elite high-school competition math problems used as a yearly stress test for chain-of-thought reasoning.
A live agent test that drops a model into a real Linux shell and asks it to complete real engineering tasks.
Long-horizon, enterprise-style coding tasks that take human engineers hours, not minutes.
Sixteen thousand earnings-call Q&A pairs that test whether a model can spot when an executive is dodging the question.
Fourteen hundred real PDFs that test whether a model can turn messy documents into clean, structured markdown.
Elite university-level competition math problems used as a 2026-fresh test of advanced reasoning.
Open head-to-head human preference rankings for chat models, the most-watched live leaderboard in AI.
Head-to-head human preference ranking for models that turn natural-language prompts into working web apps.
Head-to-head ranking for models that turn a screenshot or mockup into a working web app.
Head-to-head ranking for models that answer real questions using web search and citations.
Head-to-head ranking for vision-language models on real image-understanding prompts.
Head-to-head ranking for models that read PDFs, slides, and long screenshots to answer real questions.
Head-to-head human preference ranking for text-to-image and image-edit models, run by Arena.ai.
Head-to-head ranking for models that edit an input image given a text instruction.
Object-focused prompts that test whether a generator gets counts, positions, colors, and attributes right.
A reward model that predicts what humans will prefer, trained on hundreds of thousands of real preference labels.
A reward model that judges text-image alignment, fidelity, and aesthetic quality on a single combined score.
Head-to-head human preference ranking for text-to-video, image-to-video, and video-edit models.
Head-to-head ranking for models that animate a still input image, with or without a text instruction.
Head-to-head ranking for models that edit an input clip given a text instruction.
Sixteen-dimension benchmark covering temporal coherence, subject consistency, motion quality, and prompt fidelity.
Blind head-to-head listening test for text-to-speech models, ranked by Bradley-Terry on pairwise wins.
The standard accuracy metric for speech-to-text: lower is better.
Classic 1–5 listener rating for TTS naturalness. The longest-running quality metric in speech.
The single number most teams quote when comparing embedding models. Aggregates 56 datasets across 8 task types.
The retrieval slice of MTEB. The most important sub-score if you are building RAG.
Tests whether the embedding captures enough semantic structure for downstream classifiers to work.
Tests whether semantically similar items end up close together in the embedding space.
Measures whether the embedding distance between two sentences matches human similarity judgments.
We help product and engineering teams turn benchmark scores into shipped features. Free first conversation.