Head-to-head human preference ranking for models that turn natural-language prompts into working web apps.
WebDev Arena is the Arena.ai board for "build me an app" prompts. A user describes a small web tool, two anonymous models each generate code, and the user picks which app they like better. The ranking captures end-to-end ability: code that compiles, a layout that makes sense, working interactions, and a UI that does not look generic. It is the closest live signal for how well a model handles the prompt-to-product loop people actually do with AI coding tools.
Each side-by-side comparison is anonymous: the user never knows which model produced which app. Pairwise wins and losses feed a Bradley-Terry model that yields a single rating per model. We normalize the published rating to 0–100 so it can sit next to the other text benchmarks. Generated apps are run in a sandboxed preview, so an app that fails to build effectively cannot win.
No scores yet for this benchmark.
Not enough scored models yet.
Not enough scored models yet.
SWE-Verified asks: can the model fix a real GitHub issue against a hidden test suite? WebDev Arena asks: can the model build something from scratch that users actually want to use? Both matter, but they measure different points on the coding skill curve.
Yes, visual layout, polish, and interaction patterns all affect votes. Models that produce ugly-but-correct apps tend to lose to models that produce well-designed apps even when the second model has worse code under the hood.
For greenfield product work and prototyping, yes. For maintenance work on existing codebases, pair it with SWE-Verified and SWE-Pro; for shell agents and CLI work, pair it with Terminal Bench.
Based on score correlations across our database.