Agent benchmark covering economically valuable knowledge-work tasks across professions.
GDPval measures whether an AI agent can do tasks from real knowledge-work jobs end-to-end. Each task is drawn from a real profession and rated by domain experts on whether the agent’s output would be usable in production. It is the closest public proxy for "can this model do my white-collar work".
The agent receives a brief and a set of tools. It produces an artifact (a memo, spreadsheet, slide deck, contract) which is scored by domain experts against a rubric. Artificial Analysis runs the harness and reports the score.
No scores yet for this benchmark.
Not enough scored models yet.
Not enough scored models yet.
Based on score correlations across our database.