Benchmarks · 2025

APEX Agents (AA): APEX Multi-Step Agent Benchmark (Artificial Analysis scoring)

Name: APEX Agents (AA): APEX Multi-Step Agent Benchmark (Artificial Analysis scoring)
Creator: Artificial Analysis
Published: 2025
Keywords: APEX Agents (AA), AI benchmark, text model evaluation, Artificial Analysis

Multi-step agent benchmark focused on planning and tool use across business workflows.

Open Dataset

Models Tested

Top Score

—

Published

2025

Source

Artificial Analysis

How It Works

APEX tests whether an agent can plan and execute a multi-step business task using tools. Each task requires the agent to break down a goal, call the right tools in the right order, and recover when something goes wrong. The benchmark is one of the cleanest signals for agent reliability at non-trivial step counts.

Each task has a verifier that checks the final outcome. Scoring is task pass-rate, with breakdowns by step count and tool category to show where models break down.

Dataset size

A suite of multi-step agent tasks that combine tool use, planning, and recovery from errors.

Mean score

0.0

Median score

0.0

Open / Closed

0 / 0

Top Scorers

No scores yet for this benchmark.

Score Distribution

Not enough scored models yet.

Most Correlated Benchmarks

Not enough scored models yet.

What It Captures Well

Tests planning + tool use + recovery jointly.
Step-count breakdowns reveal where reliability drops.
Run by one team under a shared harness.

Where It Falls Short

Scaffold-dependent.
Closed methodology.
Resource-intensive to run.

Related Benchmarks

Based on score correlations across our database.

Pearson r —

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

Benchmarks · 2025

APEX Agents (AA): APEX Multi-Step Agent Benchmark (Artificial Analysis scoring)

Multi-step agent benchmark focused on planning and tool use across business workflows.

Open Dataset

Models Tested

Top Score

—

Published

2025

Source

Artificial Analysis

How It Works

Each task has a verifier that checks the final outcome. Scoring is task pass-rate, with breakdowns by step count and tool category to show where models break down.

Dataset size

A suite of multi-step agent tasks that combine tool use, planning, and recovery from errors.

Mean score

0.0

Median score

0.0

Open / Closed

0 / 0

Top Scorers

No scores yet for this benchmark.

Score Distribution

Not enough scored models yet.

Most Correlated Benchmarks

Not enough scored models yet.

What It Captures Well

Tests planning + tool use + recovery jointly.
Step-count breakdowns reveal where reliability drops.
Run by one team under a shared harness.

Where It Falls Short

Scaffold-dependent.
Closed methodology.
Resource-intensive to run.

Related Benchmarks

Based on score correlations across our database.

Pearson r —

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

APEX Agents (AA): APEX Multi-Step Agent Benchmark (Artificial Analysis scoring)

How It Works

Top Scorers

Score Distribution

Most Correlated Benchmarks

What It Captures Well

Where It Falls Short

Related Benchmarks

GPQA

MMLU-PRO

GSM8K

SWE-Verified

The AI Build Report

APEX Agents (AA): APEX Multi-Step Agent Benchmark (Artificial Analysis scoring)

How It Works

Top Scorers

Score Distribution

Most Correlated Benchmarks

What It Captures Well

Where It Falls Short

Related Benchmarks

GPQA

MMLU-PRO

GSM8K

SWE-Verified

The AI Build Report