Head-to-head ranking for vision-language models on real image-understanding prompts.
Vision Arena ranks how well vision-language models answer questions about images. A user uploads a photo, screenshot, diagram, or chart and asks something about it; two anonymous models answer; the user picks the better response. The board captures everything from basic OCR to complex chart reading and visual reasoning. It is the cleanest live signal for "can this model see and explain what it sees".
Each comparison is anonymous. Both models receive the same image and the same question, then produce text answers. Voters choose the better answer. Wins and losses feed a Bradley-Terry rating, normalized here to 0–100.
No scores yet for this benchmark.
Not enough scored models yet.
Not enough scored models yet.
olmOCR is a fixed-question PDF-to-markdown benchmark — narrow and reproducible. Vision Arena is open-ended visual Q&A on live user images — broader and more chaotic. Use olmOCR for document workflows and Vision Arena for general image understanding.
Yes — it is one of the few benchmarks that tests open-ended visual reasoning rather than narrow tasks like image-classification accuracy. Pair it with Document Arena for PDF and screenshot workflows.
Based on score correlations across our database.