Head-to-head ranking for models that animate a still input image, with or without a text instruction.
Image-to-Video Arena scores motion conditioned on a still input image. The user provides a photo, optionally adds a description of how it should move, and two anonymous models each generate a clip. Voters pick the better animation. The benchmark rewards subject preservation, plausible motion, and faithful interpretation of the prompt — the skills that matter for "make this photo move" product features.
Each comparison is anonymous. Both models receive the same input image and the same motion instruction, then produce a short clip. Bradley-Terry on pairwise wins yields a single rating per model, normalized to 0–100.
No scores yet for this benchmark.
Not enough scored models yet.
Not enough scored models yet.
Whenever you have a reference image you want to animate. Image-to-video is more controllable and tends to produce more consistent identities, but it is harder to swap subjects mid-clip.
Based on score correlations across our database.