Object-focused prompts that test whether a generator gets counts, positions, colors, and attributes right.
GenEval is a sharp test of prompt fidelity. Every prompt is engineered to require getting a specific compositional fact right — "two cats next to a red apple", "a blue car to the left of a yellow truck". An object detector verifies whether the image contains the requested objects, in the requested counts, colors, and positions.
For each generated image, an off-the-shelf detector scores whether the objects, counts, positions, colors, and attributes match the prompt. The composite score is the percent of prompts where every criterion is satisfied.
| # | Model | Lab | Source | Score |
|---|---|---|---|---|
| 01 | BAGEL-7B-MoT | Bytedance | Open | 88.0 |
| 02 | Z-Image-Turbo | Alibaba | Open | 83.0 |
| 03 | Qwen-Image | Alibaba | Open | 81.4 |
No models in this category.
Not enough scored models yet.
Most current text-to-image models score between 40% and 75%. Top diffusion and autoregressive models in 2026 push above 80% on the overall split. Anything under 35% is unlikely to handle multi-object compositional prompts reliably.
No. It only asks whether the requested objects are present with the correct properties. A cartoonish image that nails the composition can score higher than a photorealistic one that flubs an attribute.
Based on score correlations across our database.