Benchmarks · 2024

Image Arena: Arena.ai Image Leaderboard

Name: Image Arena: Arena.ai Image Leaderboard
Creator: Arena.ai (formerly LMSYS)
Published: 2024
Keywords: Image Arena, AI benchmark, image model evaluation, Arena.ai (formerly LMSYS)

Head-to-head human preference ranking for text-to-image and image-edit models, run by Arena.ai.

Open Dataset

Scores are min-max normalized. Arena.ai publishes raw Bradley-Terry / Elo ratings; we rescale them to a 0–100 axis across every scored model so they sit next to accuracy-style benchmarks. Rankings stay the same as on arena.ai.

Models Tested

Top Score

100.0

Published

2024

Source

Arena.ai (formerly LMSYS)

How It Works

Image Arena is the image-generation companion to the Arena.ai chat leaderboard. A user types a prompt, sees two anonymous images, and picks which one they prefer. Bradley-Terry on pairwise wins produces an Elo-style ranking that rewards real-world taste rather than narrow benchmark scores. Arena.ai now runs two separate image boards: text-to-image at arena.ai/leaderboard/text-to-image and image-edit at arena.ai/leaderboard/image-edit. We report the text-to-image rating here.

Voters do not see which model produced which image. Wins, losses, and ties on every pairwise comparison feed into a single rating per model. We normalize the published rating to a 0–100 scale on this page for consistency with the other modalities. The text-to-image board scores prompt-to-image generation; the image-edit board scores conditional edits where the model gets an input image plus an instruction.

Dataset size

Hundreds of thousands of anonymous side-by-side image comparisons over real prompts, split into a text-to-image and an image-edit board.

Mean score

50.4

Median score

50.3

Open / Closed

12 / 11

Top Scorers

#	Model	Lab	Source	Score
01	GPT Image 2	OpenAI	Closed	100.0
02	MAI-Image-2.5	Microsoft AI	Closed	85.3
03	GPT Image 1.5	OpenAI	Closed	82.8
04	Gemini 3 Pro Image (Nano Banana Pro)	Google	Closed	72.5
05	MAI-Image-2.5-Flash	Microsoft AI	Closed	71.2
06	FLUX.2 [pro]	Black Forest Labs	Closed	65.2
07	Gemini 2.5 Flash Image (Nano Banana)	Google	Closed	58.6
08	FLUX.2 [dev]	Black Forest Labs	Open	58.6
09	Qwen-Image-2512	Alibaba	Open	58.1
10	gpt-image-1	OpenAI	Closed	52.9
11	Hunyuan-Image 3.0	Tencent	Open	50.6
12	FLUX.2 [klein] 9B	Black Forest Labs	Open	50.3
13	Hunyuan-Image 3.0 Instruct	Tencent	Open	48.5
14	Z-Image-Turbo	Alibaba	Open	46.7
15	Imagen 4	Google	Closed	45.3

Score Distribution

Open vs Closed Source

Gap on Image Arena:+41.4pts closed leads

Top Open-Source Models

1FLUX.2 [dev]58.6
2Qwen-Image-251258.1
3Hunyuan-Image 3.050.6

Top Closed-Source Models

1GPT Image 2100
2MAI-Image-2.585.3
3GPT Image 1.582.8

Score vs Parameter Count

11 model(s) with undisclosed parameter counts not shown. Most closed-source labs do not publish model size.

Average Score by Lab

Microsoft AI
78.3n = 2
OpenAI
61.6n = 4
Google
58.8n = 3
Tencent
49.5n = 2
Black Forest Labs
48.0n = 5
Alibaba
47.0n = 3

Most Correlated Benchmarks

Image Edit Arena
+0.88n = 14
Pearson r: −1 to +1. Positive means the two benchmarks rank models in similar order; negative means the opposite.

What It Captures Well

Reflects real prompts and real human aesthetic preference, not curated test sets.
Updated continuously as new generators ship.
Hard to overfit because the prompts come from live users.

Where It Falls Short

Preference is subjective and culturally skewed toward English-speaking voters.
Slow to differentiate niche models that get few votes.
Does not measure prompt fidelity directly: see GenEval for that.

Frequently Asked Questions

How is Image Arena different from GenEval?

Image Arena measures what users prefer; GenEval measures whether the image faithfully follows the prompt. Strong prompt-following can lose to weaker fidelity if the second model is more aesthetically pleasing.

Is Image Arena reliable for non-English prompts?

Less so. The voter base is mostly English-speaking, and stylistic preferences vary by culture. For non-English use cases, weight GenEval and HPS v2 more heavily.

Related Benchmarks

Based on score correlations across our database.

Pearson r +0.88

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

Benchmarks · 2024

Image Arena: Arena.ai Image Leaderboard

Head-to-head human preference ranking for text-to-image and image-edit models, run by Arena.ai.

Open Dataset

Models Tested

Top Score

100.0

Published

2024

Source

Arena.ai (formerly LMSYS)

How It Works

Dataset size

Hundreds of thousands of anonymous side-by-side image comparisons over real prompts, split into a text-to-image and an image-edit board.

Mean score

50.4

Median score

50.3

Open / Closed

12 / 11

Top Scorers

#	Model	Lab	Source	Score
01	GPT Image 2	OpenAI	Closed	100.0
02	MAI-Image-2.5	Microsoft AI	Closed	85.3
03	GPT Image 1.5	OpenAI	Closed	82.8
04	Gemini 3 Pro Image (Nano Banana Pro)	Google	Closed	72.5
05	MAI-Image-2.5-Flash	Microsoft AI	Closed	71.2
06	FLUX.2 [pro]	Black Forest Labs	Closed	65.2
07	Gemini 2.5 Flash Image (Nano Banana)	Google	Closed	58.6
08	FLUX.2 [dev]	Black Forest Labs	Open	58.6
09	Qwen-Image-2512	Alibaba	Open	58.1
10	gpt-image-1	OpenAI	Closed	52.9
11	Hunyuan-Image 3.0	Tencent	Open	50.6
12	FLUX.2 [klein] 9B	Black Forest Labs	Open	50.3
13	Hunyuan-Image 3.0 Instruct	Tencent	Open	48.5
14	Z-Image-Turbo	Alibaba	Open	46.7
15	Imagen 4	Google	Closed	45.3

Score Distribution

Open vs Closed Source

Gap on Image Arena:+41.4pts closed leads

Top Open-Source Models

1FLUX.2 [dev]58.6
2Qwen-Image-251258.1
3Hunyuan-Image 3.050.6

Top Closed-Source Models

1GPT Image 2100
2MAI-Image-2.585.3
3GPT Image 1.582.8

Score vs Parameter Count

11 model(s) with undisclosed parameter counts not shown. Most closed-source labs do not publish model size.

Average Score by Lab

Microsoft AI
78.3n = 2
OpenAI
61.6n = 4
Google
58.8n = 3
Tencent
49.5n = 2
Black Forest Labs
48.0n = 5
Alibaba
47.0n = 3

Most Correlated Benchmarks

Image Edit Arena
+0.88n = 14
Pearson r: −1 to +1. Positive means the two benchmarks rank models in similar order; negative means the opposite.

What It Captures Well

Reflects real prompts and real human aesthetic preference, not curated test sets.
Updated continuously as new generators ship.
Hard to overfit because the prompts come from live users.

Where It Falls Short

Preference is subjective and culturally skewed toward English-speaking voters.
Slow to differentiate niche models that get few votes.
Does not measure prompt fidelity directly: see GenEval for that.

Frequently Asked Questions

How is Image Arena different from GenEval?

Is Image Arena reliable for non-English prompts?

Less so. The voter base is mostly English-speaking, and stylistic preferences vary by culture. For non-English use cases, weight GenEval and HPS v2 more heavily.

Related Benchmarks

Based on score correlations across our database.

Pearson r +0.88

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

Image Arena: Arena.ai Image Leaderboard

How It Works

Top Scorers

Score Distribution

Open vs Closed Source

Score vs Parameter Count

Average Score by Lab

Most Correlated Benchmarks

What It Captures Well

Where It Falls Short

Frequently Asked Questions

Related Benchmarks

Image Edit Arena

GenEval

HPS v2

ImageReward

The AI Build Report

Image Arena: Arena.ai Image Leaderboard

How It Works

Top Scorers

Score Distribution

Open vs Closed Source

Score vs Parameter Count

Average Score by Lab

Most Correlated Benchmarks

What It Captures Well

Where It Falls Short

Frequently Asked Questions

Related Benchmarks

Image Edit Arena

GenEval

HPS v2

ImageReward

The AI Build Report