Head-to-head ranking for models that edit an input image given a text instruction.
Image Edit Arena scores models on a different image task than text-to-image generation: a user provides an image plus a short instruction ("add a hat", "make it night", "remove the person on the left") and the model returns the edit. Voters compare two anonymous edits and pick the better one. The benchmark rewards faithful localization (changing only what was asked), preservation of the rest of the image, and instruction-following accuracy — skills that pure text-to-image models often lack.
Each comparison is anonymous. Both models receive the same input image and the same edit instruction, then produce an edited image. Bradley-Terry on pairwise wins yields a single rating, which we normalize to 0–100.
No scores yet for this benchmark.
Not enough scored models yet.
Not enough scored models yet.
Image Arena (text-to-image) starts from a blank canvas — a prompt becomes an image. Image Edit Arena starts from an existing image — a prompt plus the source becomes an edited image. Strong text-to-image models often score poorly on edits and vice versa.
For any product that lets users edit photos with natural language: photo retouching, marketing-asset variations, conditional generation, or inpainting. For from-scratch generation, prioritize Image Arena and GenEval.
Based on score correlations across our database.