Benchmarks · 2024

Document Arena: Arena.ai Document Leaderboard

Name: Document Arena: Arena.ai Document Leaderboard
Creator: Arena.ai
Published: 2024
Keywords: Document Arena, AI benchmark, text model evaluation, Arena.ai

Head-to-head ranking for models that read PDFs, slides, and long screenshots to answer real questions.

Open Dataset

Scores are min-max normalized. Arena.ai publishes raw Bradley-Terry / Elo ratings; we rescale them to a 0–100 axis across every scored model so they sit next to accuracy-style benchmarks. Rankings stay the same as on arena.ai.

Models Tested

Top Score

—

Published

2024

Source

Arena.ai

How It Works

Document Arena scores how well models read and reason over real documents. A user uploads a PDF, slide deck, or long screenshot and asks a question that needs to be answered from the contents. Two anonymous models answer; voters pick the better one. The board rewards layout-aware reading: pulling the right table cell, the right footnote, the right page out of a long document.

Each comparison is anonymous. Both models receive the same document and question, then produce answers. Wins and losses feed a Bradley-Terry rating that we normalize to 0–100.

Dataset size

Anonymous A/B comparisons of answers grounded in user-uploaded documents.

Mean score

0.0

Median score

0.0

Open / Closed

0 / 0

Top Scorers

No scores yet for this benchmark.

Score Distribution

Not enough scored models yet.

Most Correlated Benchmarks

Not enough scored models yet.

What It Captures Well

Captures the full document-reading loop, including long-context and layout reasoning.
High-leverage real-world skill — finance, legal, healthcare, and enterprise workflows all start here.
Hard to game because documents are user-supplied and varied.

Where It Falls Short

Requires a multimodal model that can render document pages.
Document length is bounded by the model context limit; very long documents are truncated.
Preference voting can reward confident-sounding wrong answers if the right answer is buried.

Frequently Asked Questions

How is Document Arena different from olmOCR?

olmOCR scores literal document-to-markdown conversion on a fixed set of PDFs with unit-test grading. Document Arena scores open-ended question answering over user-supplied documents with human voting. olmOCR is sharp and reproducible; Document Arena is broad and reflective of real use.

Should I weight Document Arena or Vision Arena more?

If your workflow is PDFs, contracts, financial reports, or slide decks: Document Arena. If it is photos, diagrams, screenshots of UI, or open-ended visual Q&A: Vision Arena. They correlate but capture different real-world failure modes.