Benchmarks · 2024

WebDev Arena: Arena.ai WebDev Leaderboard

Name: WebDev Arena: Arena.ai WebDev Leaderboard
Creator: Arena.ai
Published: 2024
Keywords: WebDev Arena, AI benchmark, text model evaluation, Arena.ai

Head-to-head human preference ranking for models that turn natural-language prompts into working web apps.

Open Dataset

Scores are min-max normalized. Arena.ai publishes raw Bradley-Terry / Elo ratings; we rescale them to a 0–100 axis across every scored model so they sit next to accuracy-style benchmarks. Rankings stay the same as on arena.ai.

Models Tested

Top Score

—

Published

2024

Source

Arena.ai

How It Works

WebDev Arena is the Arena.ai board for "build me an app" prompts. A user describes a small web tool, two anonymous models each generate code, and the user picks which app they like better. The ranking captures end-to-end ability: code that compiles, a layout that makes sense, working interactions, and a UI that does not look generic. It is the closest live signal for how well a model handles the prompt-to-product loop people actually do with AI coding tools.

Each side-by-side comparison is anonymous: the user never knows which model produced which app. Pairwise wins and losses feed a Bradley-Terry model that yields a single rating per model. We normalize the published rating to 0–100 so it can sit next to the other text benchmarks. Generated apps are run in a sandboxed preview, so an app that fails to build effectively cannot win.

Dataset size

Tens of thousands of anonymous A/B comparisons of generated web apps across real prompts.

Mean score

0.0

Median score

0.0

Open / Closed

0 / 0

Top Scorers

No scores yet for this benchmark.

Score Distribution

Not enough scored models yet.

Most Correlated Benchmarks

Not enough scored models yet.

What It Captures Well

Captures the full coding-as-product loop, not isolated code-snippet correctness.
Real user prompts: closer to consumer and indie-hacker workflows than SWE-Verified.
Updated continuously, so new releases get a rating within days.

Where It Falls Short

Preference is subjective — visual taste influences voting beyond pure code quality.
Single-shot generation; does not measure iterative refinement.
Heavily weighted toward small front-end apps; backend and infra ability is under-represented.

Frequently Asked Questions

How is WebDev Arena different from SWE-Verified?

SWE-Verified asks: can the model fix a real GitHub issue against a hidden test suite? WebDev Arena asks: can the model build something from scratch that users actually want to use? Both matter, but they measure different points on the coding skill curve.

Does WebDev Arena reward design quality?

Yes — visual layout, polish, and interaction patterns all affect votes. Models that produce ugly-but-correct apps tend to lose to models that produce well-designed apps even when the second model has worse code under the hood.

Is this a good benchmark for picking a coding model?

For greenfield product work and prototyping, yes. For maintenance work on existing codebases, pair it with SWE-Verified and SWE-Pro; for shell agents and CLI work, pair it with Terminal Bench.

Related Benchmarks

Based on score correlations across our database.

Pearson r —

GPQA

n = 0

Picking the Right Model for Your Use Case?

We help product and engineering teams turn benchmark scores into shipped features. Free first conversation.