Open head-to-head human preference rankings for chat models, the most-watched live leaderboard in AI.
Arena.ai (formerly LMSYS Chatbot Arena) shows two anonymous model outputs side by side for a real user prompt and asks the user to pick which one is better. Bradley-Terry pairs are aggregated into an Elo-style score that ranks models by how often humans prefer them. Unlike fixed-question benchmarks, the prompts come from real users, so the score reflects everyday usefulness rather than test-taking ability.
Every comparison is anonymous: the user does not see which model produced which response. Pairwise wins, losses, and ties feed a Bradley-Terry model that yields a single rating per model. We normalize the published rating to a 0–100 scale on this page so it can be compared against the other text benchmarks at a glance. Arena.ai now runs sister leaderboards across modalities and specialized tasks — Image Arena (text-to-image, image-edit), Video Arena (text-to-video, image-to-video, video-edit), plus dedicated boards for code (WebDev, Image-to-WebDev), search, vision, and document tasks — all using the same Bradley-Terry methodology.
| # | Model | Lab | Source | Score |
|---|---|---|---|---|
| 01 | Claude Opus 4.6 (Thinking) | Anthropic | Closed | 100.0 |
| 02 | Claude Opus 4.6 | Anthropic | Closed | 99.4 |
| 03 | Gemini 3.1 Pro Preview | Closed | 98.1 | |
| 04 | Claude Opus 4.7 Thinking | Anthropic | Closed | 98.0 |
| 05 | Gemini 3 Pro | Closed | 96.9 | |
| 06 | Claude Opus 4.7 | Anthropic | Closed | 96.6 |
| 07 | Meta Muse Spark | Meta | Closed | 96.5 |
| 08 | Qwen3.5 Max Preview | Alibaba | Closed | 95.3 |
| 09 | GPT-5.4 High | OpenAI | Closed | 95.3 |
| 10 | GLM-5.1 | Z.ai | Open | 95.1 |
| 11 | Gemini 3 Flash | Closed | 95.0 | |
| 12 | GPT-5.5 | OpenAI | Closed | 94.2 |
| 13 | Gemini 2.5 Pro | Closed | 94.0 | |
| 14 | Grok 4.20 Beta 0309 Reasoning | xAI | Closed | 93.2 |
| 15 | Kimi K2.6 | Moonshot AI | Open | 93.2 |
97 model(s) with undisclosed parameter counts not shown. Most closed-source labs do not publish model size.
Both. The underlying data is millions of pairwise votes, released openly. The headline output is the Elo-style ranking, which is what most people mean when they say "Arena score".
Academic benchmarks reward correctness on fixed questions. Arena rewards what users prefer, which mixes correctness with style, helpfulness, and tone. A model that is technically right but cold and verbose can lose to a warmer model on Arena while winning on GPQA.
Treat Arena as the "general consumer feel" score. Pair it with a task-specific benchmark — SWE-Verified for coding, GPQA for science reasoning, EvasionBench for finance — to avoid choosing a model that feels good but underperforms on your actual workload.
They are the same project. LMSYS Org rebranded the Chatbot Arena to Arena.ai. Older papers and articles still call it "LM Arena" or "Chatbot Arena".
Based on score correlations across our database.