Benchmarks · 2024

Image-to-WebDev: Arena.ai Image-to-WebDev Leaderboard

Name: Image-to-WebDev: Arena.ai Image-to-WebDev Leaderboard
Creator: Arena.ai
Published: 2024
Keywords: Image-to-WebDev, AI benchmark, text model evaluation, Arena.ai

Head-to-head ranking for models that turn a screenshot or mockup into a working web app.

Open Dataset

Scores are min-max normalized. Arena.ai publishes raw Bradley-Terry / Elo ratings; we rescale them to a 0–100 axis across every scored model so they sit next to accuracy-style benchmarks. Rankings stay the same as on arena.ai.

Models Tested

Top Score

—

Published

2024

Source

Arena.ai

How It Works

Image-to-WebDev tests one of the most-requested AI-coding workflows: paste a screenshot of a UI, get a working clone. The model receives an input image plus an optional natural-language hint, then produces a runnable web app. Voters compare two anonymous reproductions of the same source image and pick the one that looks and behaves closer to the original. The benchmark stresses three things at once: image understanding, code generation, and visual taste.

Each comparison is anonymous. The model is given the reference image and produces code; the generated app is rendered in a sandbox and shown side-by-side with another model's attempt. Bradley-Terry on pairwise wins produces an Elo-style rating, which we normalize to 0–100.

Dataset size

Anonymous A/B comparisons of web apps generated from real reference images.

Mean score

0.0

Median score

0.0

Open / Closed

0 / 0

Top Scorers

No scores yet for this benchmark.

Score Distribution

Not enough scored models yet.

Most Correlated Benchmarks

Not enough scored models yet.

What It Captures Well

Tests the full vision-to-code stack — image understanding plus generation plus design.
A strong proxy for the "design handoff" workflow that real product teams care about.
Hard to game because reference images change frequently.

Where It Falls Short

Requires a vision-capable model — text-only LLMs cannot compete.
Visual fidelity is rewarded over code structure, so models can win with brittle code.
Small board today; ratings can be noisy for models with few comparisons.

Frequently Asked Questions

How does this differ from WebDev Arena?

WebDev Arena starts from a text prompt. Image-to-WebDev starts from a reference image. The skills overlap, but vision-capable models with strong layout reasoning have a much bigger edge on Image-to-WebDev.

Is this useful if I do not have screenshots to clone?

Less directly. But it is a strong predictor of how well a model can interpret design feedback in image form — Figma frames, whiteboard photos, hand-drawn sketches — which is the same skill in a different package.