Benchmarks · 2024

Search Arena: Arena.ai Search Leaderboard

Name: Search Arena: Arena.ai Search Leaderboard
Creator: Arena.ai
Published: 2024
Keywords: Search Arena, AI benchmark, text model evaluation, Arena.ai

Head-to-head ranking for models that answer real questions using web search and citations.

Open Dataset

Scores are min-max normalized. Arena.ai publishes raw Bradley-Terry / Elo ratings; we rescale them to a 0–100 axis across every scored model so they sit next to accuracy-style benchmarks. Rankings stay the same as on arena.ai.

Models Tested

Top Score

—

Published

2024

Source

Arena.ai

How It Works

Search Arena ranks models on the "ChatGPT search / Perplexity / You.com" task: take a real question, search the web, and produce a grounded answer with citations. The vote rewards a combination of correctness, citation quality, freshness, and writing. It is the closest public signal for how well a model handles the RAG-style workflows that most production AI assistants now ship.

Each comparison is anonymous. Two search-augmented models receive the same user question, search the live web, and produce a cited answer. Voters pick which answer they trust more. Wins, losses, and ties feed a Bradley-Terry rating, normalized here to 0–100.

Dataset size

Anonymous A/B comparisons of search-grounded answers across live, open-web prompts.

Mean score

0.0

Median score

0.0

Open / Closed

0 / 0

Top Scorers

No scores yet for this benchmark.

Score Distribution

Not enough scored models yet.

Most Correlated Benchmarks

Not enough scored models yet.

What It Captures Well

Tests the full search-grounded answering loop, including citation quality.
Strong predictor of how a model will feel in a real RAG product.
Live web prompts, so the underlying corpus stays fresh.

Where It Falls Short

Score depends heavily on the search tool wrapper; the base model is only one factor.
Preference voting penalizes terse answers even when they cite better.
Skewed toward English queries.

Frequently Asked Questions

How is this different from MTEB Retrieval?

MTEB Retrieval scores how well an embedding model ranks documents for a query. Search Arena scores the entire pipeline — search + reading + writing the answer — that a user actually interacts with. Embedding retrieval is one input to the full answer.

Should I pick a model by Search Arena rating?

For consumer-facing search and Q&A products, yes — pair it with the chat Arena Score. For internal RAG over your own data, weight MTEB Retrieval and your own offline evaluation more heavily.