Multi-step agent benchmark focused on planning and tool use across business workflows.
APEX tests whether an agent can plan and execute a multi-step business task using tools. Each task requires the agent to break down a goal, call the right tools in the right order, and recover when something goes wrong. The benchmark is one of the cleanest signals for agent reliability at non-trivial step counts.
Each task has a verifier that checks the final outcome. Scoring is task pass-rate, with breakdowns by step count and tool category to show where models break down.
No scores yet for this benchmark.
Not enough scored models yet.
Not enough scored models yet.
Based on score correlations across our database.