Benchmarks · 2023

Arena Score: Arena.ai Chat Leaderboard

Name: Arena Score: Arena.ai Chat Leaderboard
Creator: Arena.ai (formerly LMSYS Chatbot Arena)
Published: 2023
Keywords: Arena Score, AI benchmark, text model evaluation, Arena.ai (formerly LMSYS Chatbot Arena)

Open head-to-head human preference rankings for chat models, the most-watched live leaderboard in AI.

Open Dataset Read Paper

Scores are min-max normalized. Arena.ai publishes raw Bradley-Terry / Elo ratings; we rescale them to a 0–100 axis across every scored model so they sit next to accuracy-style benchmarks. Rankings stay the same as on arena.ai.

Models Tested

143

Top Score

100.0

Published

2023

Source

Arena.ai (formerly LMSYS Chatbot Arena)

How It Works

Arena.ai (formerly LMSYS Chatbot Arena) shows two anonymous model outputs side by side for a real user prompt and asks the user to pick which one is better. Bradley-Terry pairs are aggregated into an Elo-style score that ranks models by how often humans prefer them. Unlike fixed-question benchmarks, the prompts come from real users, so the score reflects everyday usefulness rather than test-taking ability.

Every comparison is anonymous: the user does not see which model produced which response. Pairwise wins, losses, and ties feed a Bradley-Terry model that yields a single rating per model. We normalize the published rating to a 0–100 scale on this page so it can be compared against the other text benchmarks at a glance. Arena.ai now runs sister leaderboards across modalities and specialized tasks — Image Arena (text-to-image, image-edit), Video Arena (text-to-video, image-to-video, video-edit), plus dedicated boards for code (WebDev, Image-to-WebDev), search, vision, and document tasks — all using the same Bradley-Terry methodology.

Dataset size

Millions of anonymous, blind side-by-side comparisons between chat models, refreshed continuously.

Mean score

81.4

Median score

85.5

Open / Closed

45 / 98

Top Scorers

#	Model	Lab	Source	Score
01	Claude Opus 4.6 (Thinking)	Anthropic	Closed	100.0
02	Claude Opus 4.6	Anthropic	Closed	99.4
03	Gemini 3.1 Pro Preview	Google	Closed	98.1
04	Claude Opus 4.7 Thinking	Anthropic	Closed	98.0
05	Gemini 3 Pro	Google	Closed	96.9
06	Claude Opus 4.7	Anthropic	Closed	96.6
07	Meta Muse Spark	Meta	Closed	96.5
08	Qwen3.5 Max Preview	Alibaba	Closed	95.3
09	GPT-5.4 High	OpenAI	Closed	95.3
10	GLM-5.1	Z.ai	Open	95.1
11	Gemini 3 Flash	Google	Closed	95.0
12	GPT-5.5	OpenAI	Closed	94.2
13	Gemini 2.5 Pro	Google	Closed	94.0
14	Grok 4.20 Beta 0309 Reasoning	xAI	Closed	93.2
15	Kimi K2.6	Moonshot AI	Open	93.2

Score Distribution

Open vs Closed Source

Gap on Arena Score:+4.9pts closed leads

Top Open-Source Models

1GLM-5.195.1
2Kimi K2.693.2
3Kimi K2.591.8

Top Closed-Source Models

1Claude Opus 4.6 (Thinking)100
2Claude Opus 4.699.4
3Gemini 3.1 Pro Preview98.1

Score vs Parameter Count

97 model(s) with undisclosed parameter counts not shown. Most closed-source labs do not publish model size.

Average Score by Lab

Z.ai
91.5n = 5
Baidu
91.1n = 3
Anthropic
87.9n = 17
Moonshot AI
86.9n = 5
DeepSeek
86.6n = 4
Google
86.3n = 16
xAI
86.3n = 13
Alibaba
84.6n = 14
Mistral
84.6n = 2
OpenAI
83.4n = 27

Most Correlated Benchmarks

GPQA
+0.86n = 27
SWE-Verified
+0.73n = 21
Terminal Bench
+0.72n = 20
SWE-Pro
+0.71n = 16
HLE
+0.59n = 23
MMLU-PRO
+0.54n = 12
AIME 2026
+0.50n = 19
HMMT 2026
+0.48n = 9
Pearson r: −1 to +1. Positive means the two benchmarks rank models in similar order; negative means the opposite.

What It Captures Well

Built on real, open-ended user prompts — the closest public benchmark to "what people actually ask".
Hard to game with training data tricks, because the prompts come from live users.
Updated continuously, so new releases get a rating within days.

Where It Falls Short

Preference is subjective and English-skewed; some categories are dominated by stylistic taste.
New models can ride a novelty bump before their score settles.
Anonymous voting can be manipulated at small scale; Arena.ai filters but cannot eliminate noise.

Frequently Asked Questions

Is the Arena Score a dataset or a leaderboard?

Both. The underlying data is millions of pairwise votes, released openly. The headline output is the Elo-style ranking, which is what most people mean when they say "Arena score".

Why does the Arena Score disagree with academic benchmarks?

Academic benchmarks reward correctness on fixed questions. Arena rewards what users prefer, which mixes correctness with style, helpfulness, and tone. A model that is technically right but cold and verbose can lose to a warmer model on Arena while winning on GPQA.

How should I use the Arena Score when picking a model?

Treat Arena as the "general consumer feel" score. Pair it with a task-specific benchmark — SWE-Verified for coding, GPQA for science reasoning, EvasionBench for finance — to avoid choosing a model that feels good but underperforms on your actual workload.

What is the difference between LM Arena and Arena.ai?

They are the same project. LMSYS Org rebranded the Chatbot Arena to Arena.ai. Older papers and articles still call it "LM Arena" or "Chatbot Arena".

Related Benchmarks

Based on score correlations across our database.

Pearson r +0.86

Picking the Right Model for Your Use Case?

We help product and engineering teams turn benchmark scores into shipped features. Free first conversation.

Benchmarks · 2023

Arena Score: Arena.ai Chat Leaderboard

Open head-to-head human preference rankings for chat models, the most-watched live leaderboard in AI.

Open Dataset Read Paper

Models Tested

143

Top Score

100.0

Published

2023

Source

Arena.ai (formerly LMSYS Chatbot Arena)

How It Works

Dataset size

Millions of anonymous, blind side-by-side comparisons between chat models, refreshed continuously.

Mean score

81.4

Median score

85.5

Open / Closed

45 / 98

Top Scorers

#	Model	Lab	Source	Score
01	Claude Opus 4.6 (Thinking)	Anthropic	Closed	100.0
02	Claude Opus 4.6	Anthropic	Closed	99.4
03	Gemini 3.1 Pro Preview	Google	Closed	98.1
04	Claude Opus 4.7 Thinking	Anthropic	Closed	98.0
05	Gemini 3 Pro	Google	Closed	96.9
06	Claude Opus 4.7	Anthropic	Closed	96.6
07	Meta Muse Spark	Meta	Closed	96.5
08	Qwen3.5 Max Preview	Alibaba	Closed	95.3
09	GPT-5.4 High	OpenAI	Closed	95.3
10	GLM-5.1	Z.ai	Open	95.1
11	Gemini 3 Flash	Google	Closed	95.0
12	GPT-5.5	OpenAI	Closed	94.2
13	Gemini 2.5 Pro	Google	Closed	94.0
14	Grok 4.20 Beta 0309 Reasoning	xAI	Closed	93.2
15	Kimi K2.6	Moonshot AI	Open	93.2

Score Distribution

Open vs Closed Source

Gap on Arena Score:+4.9pts closed leads

Top Open-Source Models

1GLM-5.195.1
2Kimi K2.693.2
3Kimi K2.591.8

Top Closed-Source Models

1Claude Opus 4.6 (Thinking)100
2Claude Opus 4.699.4
3Gemini 3.1 Pro Preview98.1

Score vs Parameter Count

97 model(s) with undisclosed parameter counts not shown. Most closed-source labs do not publish model size.

Average Score by Lab

Z.ai
91.5n = 5
Baidu
91.1n = 3
Anthropic
87.9n = 17
Moonshot AI
86.9n = 5
DeepSeek
86.6n = 4
Google
86.3n = 16
xAI
86.3n = 13
Alibaba
84.6n = 14
Mistral
84.6n = 2
OpenAI
83.4n = 27

Most Correlated Benchmarks

GPQA
+0.86n = 27
SWE-Verified
+0.73n = 21
Terminal Bench
+0.72n = 20
SWE-Pro
+0.71n = 16
HLE
+0.59n = 23
MMLU-PRO
+0.54n = 12
AIME 2026
+0.50n = 19
HMMT 2026
+0.48n = 9
Pearson r: −1 to +1. Positive means the two benchmarks rank models in similar order; negative means the opposite.

What It Captures Well

Built on real, open-ended user prompts — the closest public benchmark to "what people actually ask".
Hard to game with training data tricks, because the prompts come from live users.
Updated continuously, so new releases get a rating within days.

Where It Falls Short

Preference is subjective and English-skewed; some categories are dominated by stylistic taste.
New models can ride a novelty bump before their score settles.
Anonymous voting can be manipulated at small scale; Arena.ai filters but cannot eliminate noise.

Frequently Asked Questions

Is the Arena Score a dataset or a leaderboard?

Both. The underlying data is millions of pairwise votes, released openly. The headline output is the Elo-style ranking, which is what most people mean when they say "Arena score".

Why does the Arena Score disagree with academic benchmarks?

How should I use the Arena Score when picking a model?

What is the difference between LM Arena and Arena.ai?

They are the same project. LMSYS Org rebranded the Chatbot Arena to Arena.ai. Older papers and articles still call it "LM Arena" or "Chatbot Arena".

Related Benchmarks

Based on score correlations across our database.

Pearson r +0.86

Picking the Right Model for Your Use Case?

We help product and engineering teams turn benchmark scores into shipped features. Free first conversation.

Arena Score: Arena.ai Chat Leaderboard

How It Works

Top Scorers

Score Distribution

Open vs Closed Source

Score vs Parameter Count

Average Score by Lab

Most Correlated Benchmarks

What It Captures Well

Where It Falls Short

Frequently Asked Questions

Related Benchmarks

GPQA

SWE-Verified

Terminal Bench

SWE-Pro

Picking the Right Model for Your Use Case?

Arena Score: Arena.ai Chat Leaderboard

How It Works

Top Scorers

Score Distribution

Open vs Closed Source

Score vs Parameter Count

Average Score by Lab

Most Correlated Benchmarks

What It Captures Well

Where It Falls Short

Frequently Asked Questions

Related Benchmarks

GPQA

SWE-Verified

Terminal Bench

SWE-Pro

Picking the Right Model for Your Use Case?