Benchmarks · 2024

VBench: VBench Comprehensive Video Generation Benchmark

Name: VBench: VBench Comprehensive Video Generation Benchmark
Creator: Vchitect Lab (Shanghai AI Lab)
Published: 2024
Keywords: VBench, AI benchmark, video model evaluation, Vchitect Lab (Shanghai AI Lab)

Sixteen-dimension benchmark covering temporal coherence, subject consistency, motion quality, and prompt fidelity.

Open Dataset Read Paper

Models Tested

Top Score

86.2

Published

2024

Source

Vchitect Lab (Shanghai AI Lab)

How It Works

VBench breaks video generation into the dimensions that humans actually notice: subject consistency over time, background coherence, motion smoothness, dynamic degree, aesthetic quality, imaging quality, object class, multiple objects, human action, color, spatial relationship, scene, appearance style, temporal style, overall consistency, and prompt fidelity. Each dimension has its own scoring pipeline, and the overall score is a weighted average.

For each dimension, VBench uses a tailored scoring method — object detectors for class fidelity, motion estimators for smoothness, classifiers for style, and so on. Models are run on a fixed prompt set and scored per dimension. The headline number is a weighted aggregate; per-dimension scores are the more actionable read.

Dataset size

16 evaluation dimensions across ~700 prompts and thousands of generated clips.

Mean score

81.8

Median score

81.8

Open / Closed

2 / 0

Top Scorers

#	Model	Lab	Source	Score
01	Wan2.2-T2V-A14B	Alibaba	Open	86.2
02	Mochi 1 Preview	Genmo AI	Open	77.4

Score Distribution

Open vs Closed Source

Top Open-Source Models

1Wan2.2-T2V-A14B86.2
2Mochi 1 Preview77.4

Top Closed-Source Models

No models in this category.

Score vs Parameter Count

Most Correlated Benchmarks

Not enough scored models yet.

What It Captures Well

Granular: diagnoses exactly where a model is weak.
Automated scoring, so it is cheap to re-run on new models.
Used by every serious video-generation paper, so scores are comparable.

Where It Falls Short

Some dimensions use proxies (e.g., detectors) that can miss subtle failures.
Aggregate score hides large variance across dimensions.
Does not capture overall "wow factor" — pair with Video Arena for that.

Frequently Asked Questions

What VBench score should I expect from a strong video model?

Top closed-source models in 2026 score 82–86% on the overall index. Strong open-weight models are between 75% and 80%. Below 70% the failure modes start to show up in casual viewing.