Benchmarks · 2023

GenEval: GenEval Compositional Image Generation Benchmark

Name: GenEval: GenEval Compositional Image Generation Benchmark
Creator: University of Washington and Microsoft
Published: 2023
Keywords: GenEval, AI benchmark, image model evaluation, University of Washington and Microsoft

Object-focused prompts that test whether a generator gets counts, positions, colors, and attributes right.

Open Dataset Read Paper

Models Tested

Top Score

88.0

Published

2023

Source

University of Washington and Microsoft

How It Works

GenEval is a sharp test of prompt fidelity. Every prompt is engineered to require getting a specific compositional fact right — "two cats next to a red apple", "a blue car to the left of a yellow truck". An object detector verifies whether the image contains the requested objects, in the requested counts, colors, and positions.

For each generated image, an off-the-shelf detector scores whether the objects, counts, positions, colors, and attributes match the prompt. The composite score is the percent of prompts where every criterion is satisfied.

Dataset size

A few thousand compositional prompts probing counting, positioning, color, and attribute binding.

Mean score

84.1

Median score

83.0

Open / Closed

3 / 0

Top Scorers

#	Model	Lab	Source	Score
01	BAGEL-7B-MoT	Bytedance	Open	88.0
02	Z-Image-Turbo	Alibaba	Open	83.0
03	Qwen-Image	Alibaba	Open	81.4

Score Distribution

Open vs Closed Source

Top Open-Source Models

1BAGEL-7B-MoT88
2Z-Image-Turbo83
3Qwen-Image81.4

Top Closed-Source Models

No models in this category.

Score vs Parameter Count

Average Score by Lab

Alibaba
82.2n = 2

Most Correlated Benchmarks

Not enough scored models yet.

What It Captures Well

Automatic scoring — no humans needed, so it is cheap to re-run.
Catches the most common image-generation failure modes (wrong count, wrong color, wrong position).
Strong signal for tasks that need accurate scene composition.

Where It Falls Short

Object-detector-bound — if the detector misses a small object, the model is unfairly penalized.
Does not measure aesthetic quality, prompt adherence on abstract concepts, or photo-realism.
English-only prompts.

Frequently Asked Questions

What is a strong GenEval score?

Most current text-to-image models score between 40% and 75%. Top diffusion and autoregressive models in 2026 push above 80% on the overall split. Anything under 35% is unlikely to handle multi-object compositional prompts reliably.

Does GenEval reward photo-realism?

No. It only asks whether the requested objects are present with the correct properties. A cartoonish image that nails the composition can score higher than a photorealistic one that flubs an attribute.