Head-to-head ranking for models that answer real questions using web search and citations.
Search Arena ranks models on the "ChatGPT search / Perplexity / You.com" task: take a real question, search the web, and produce a grounded answer with citations. The vote rewards a combination of correctness, citation quality, freshness, and writing. It is the closest public signal for how well a model handles the RAG-style workflows that most production AI assistants now ship.
Each comparison is anonymous. Two search-augmented models receive the same user question, search the live web, and produce a cited answer. Voters pick which answer they trust more. Wins, losses, and ties feed a Bradley-Terry rating, normalized here to 0–100.
No scores yet for this benchmark.
Not enough scored models yet.
Not enough scored models yet.
MTEB Retrieval scores how well an embedding model ranks documents for a query. Search Arena scores the entire pipeline — search + reading + writing the answer — that a user actually interacts with. Embedding retrieval is one input to the full answer.
For consumer-facing search and Q&A products, yes — pair it with the chat Arena Score. For internal RAG over your own data, weight MTEB Retrieval and your own offline evaluation more heavily.
Based on score correlations across our database.