Benchmarks · 2025

GDPval (AA): GDPval Agent Benchmark (Artificial Analysis scoring)

Name: GDPval (AA): GDPval Agent Benchmark (Artificial Analysis scoring)
Creator: Mercor, scored by Artificial Analysis
Published: 2025
Keywords: GDPval (AA), AI benchmark, text model evaluation, Mercor, scored by Artificial Analysis

Agent benchmark covering economically valuable knowledge-work tasks across professions.

Open Dataset

Models Tested

Top Score

—

Published

2025

Source

Mercor, scored by Artificial Analysis

How It Works

GDPval measures whether an AI agent can do tasks from real knowledge-work jobs end-to-end. Each task is drawn from a real profession and rated by domain experts on whether the agent’s output would be usable in production. It is the closest public proxy for "can this model do my white-collar work".

The agent receives a brief and a set of tools. It produces an artifact (a memo, spreadsheet, slide deck, contract) which is scored by domain experts against a rubric. Artificial Analysis runs the harness and reports the score.

Dataset size

A growing suite of agent tasks drawn from real knowledge-work professions, including legal, finance, sales, and operations.

Mean score

0.0

Median score

0.0

Open / Closed

0 / 0

Top Scorers

No scores yet for this benchmark.

Score Distribution

Not enough scored models yet.

Most Correlated Benchmarks

Not enough scored models yet.

What It Captures Well

Real-world tasks instead of academic puzzles.
Expert scoring catches subtle correctness failures.
Cross-profession breadth.

Where It Falls Short

Expensive to run, so the leaderboard updates slowly.
Expert scoring introduces some subjectivity.
Scaffold-dependent like all agent benchmarks.

Related Benchmarks

Based on score correlations across our database.

Pearson r —

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

Benchmarks · 2025

GDPval (AA): GDPval Agent Benchmark (Artificial Analysis scoring)

Agent benchmark covering economically valuable knowledge-work tasks across professions.

Open Dataset

Models Tested

Top Score

—

Published

2025

Source

Mercor, scored by Artificial Analysis

How It Works

Dataset size

A growing suite of agent tasks drawn from real knowledge-work professions, including legal, finance, sales, and operations.

Mean score

0.0

Median score

0.0

Open / Closed

0 / 0

Top Scorers

No scores yet for this benchmark.

Score Distribution

Not enough scored models yet.

Most Correlated Benchmarks

Not enough scored models yet.

What It Captures Well

Real-world tasks instead of academic puzzles.
Expert scoring catches subtle correctness failures.
Cross-profession breadth.

Where It Falls Short

Expensive to run, so the leaderboard updates slowly.
Expert scoring introduces some subjectivity.
Scaffold-dependent like all agent benchmarks.

Related Benchmarks

Based on score correlations across our database.

Pearson r —

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

GDPval (AA): GDPval Agent Benchmark (Artificial Analysis scoring)

How It Works

Top Scorers

Score Distribution

Most Correlated Benchmarks

What It Captures Well

Where It Falls Short

Related Benchmarks

GPQA

MMLU-PRO

GSM8K

SWE-Verified

The AI Build Report

GDPval (AA): GDPval Agent Benchmark (Artificial Analysis scoring)

How It Works

Top Scorers

Score Distribution

Most Correlated Benchmarks

What It Captures Well

Where It Falls Short

Related Benchmarks

GPQA

MMLU-PRO

GSM8K

SWE-Verified

The AI Build Report