Head-to-head ranking for models that edit an input clip given a text instruction.
Video Edit Arena scores models on instruction-based clip editing: a user supplies a video plus a text instruction ("make it daytime", "change the car color", "remove the person on the right") and the model returns the edited clip. Voters compare two anonymous edits. The board rewards localized edits, temporal coherence, and instruction adherence — much harder than image edits because changes must be consistent across every frame.
Each comparison is anonymous. Both models receive the same input clip and the same edit instruction, then produce an edited clip. Bradley-Terry on pairwise wins yields a single rating per model, normalized to 0–100.
No scores yet for this benchmark.
Not enough scored models yet.
Not enough scored models yet.
Same setup, harder problem. Video edits have to stay coherent across every frame, which adds an order of magnitude of difficulty. Models often do well on image edits but stumble on video.
Based on score correlations across our database.