Head-to-head human preference ranking for models that turn natural-language prompts into working web apps.
WebDev Arena is the Arena.ai board for "build me an app" prompts. A user describes a small web tool, two anonymous models each generate code, and the user picks which app they like better. The ranking captures end-to-end ability: code that compiles, a layout that makes sense, working interactions, and a UI that does not look generic. It is the closest live signal for how well a model handles the prompt-to-product loop people actually do with AI coding tools.
Each side-by-side comparison is anonymous: the user never knows which model produced which app. Pairwise wins and losses feed a Bradley-Terry model that yields a single rating per model. We normalize the published rating to 0–100 so it can sit next to the other text benchmarks. Generated apps are run in a sandboxed preview, so an app that fails to build effectively cannot win.
No scores yet for this benchmark.
Not enough scored models yet.
Not enough scored models yet.
SWE-Verified asks: can the model fix a real GitHub issue against a hidden test suite? WebDev Arena asks: can the model build something from scratch that users actually want to use? Both matter, but they measure different points on the coding skill curve.
Yes — visual layout, polish, and interaction patterns all affect votes. Models that produce ugly-but-correct apps tend to lose to models that produce well-designed apps even when the second model has worse code under the hood.
For greenfield product work and prototyping, yes. For maintenance work on existing codebases, pair it with SWE-Verified and SWE-Pro; for shell agents and CLI work, pair it with Terminal Bench.