A customer-service simulation where an agent has to follow airline policy and use tools to actually resolve a request.
Tau-Bench drops the agent into a realistic airline support chat. A simulated customer asks to change or cancel a booking, and the agent has to call the right tools, follow the airline's written policy, and only make changes that are actually allowed. It measures whether an agent can stay on the rails over a multi-turn conversation, not just answer one question.
The agent talks to a simulated user and a set of tools backed by a database. Success means the database ends in the correct state and no policy rule was broken. Scores report the success rate, and a stricter "pass^k" variant checks whether the agent succeeds reliably across repeated attempts.
| # | Agent System | Model | Score |
|---|---|---|---|
| 01 | HAL Generalist Agent | Claude-3.7 Sonnet (February 2025) | 56.0 |
| 02 | TAU-bench Tool Calling | o4-mini High (April 2025) | 56.0 |
| 03 | HAL Generalist Agent | Claude Opus 4.1 (August 2025) | 54.0 |
| 04 | TAU-bench Tool Calling | o3 Medium (April 2025) | 54.0 |
| 05 | TAU-bench Tool Calling | Claude Opus 4.1 High (August 2025) | 52.0 |
| 06 | TAU-bench Tool Calling | Claude-3.7 Sonnet High (February 2025) | 52.0 |
| 07 | TAU-bench Tool Calling | Claude Opus 4.1 (August 2025) | 50.0 |
| 08 | TAU-bench Tool Calling | GPT-5 Medium (August 2025) | 48.0 |
| 09 | HAL Generalist Agent | Claude Opus 4 High (May 2025) | 44.0 |
| 10 | HAL Generalist Agent | Claude Opus 4 (May 2025) | 44.0 |
| 11 | HAL Generalist Agent | Claude-3.7 Sonnet High (February 2025) | 44.0 |
| 12 | TAU-bench Tool Calling | Claude-3.7 Sonnet (February 2025) | 44.0 |
| 13 | TAU-bench Tool Calling | DeepSeek V3 (March 2025) | 44.0 |
| 14 | TAU-bench Tool Calling | DeepSeek R1 (January 2025) | 36.0 |
| 15 | TAU-bench Tool Calling | GPT-4.1 (April 2025) | 36.0 |
| 16 | TAU-bench Tool Calling | o4-mini Low (April 2025) | 36.0 |
| 17 | HAL Generalist Agent | Claude Opus 4.1 High (August 2025) | 32.0 |
| 18 | HAL Generalist Agent | GPT-5 Medium (August 2025) | 30.0 |
| 19 | TAU-bench Tool Calling | Gemini 2.0 Flash High (February 2025) | 28.0 |
| 20 | HAL Generalist Agent | o4-mini Low (April 2025) | 22.0 |
| 21 | HAL Generalist Agent | Gemini 2.0 Flash (February 2025) | 22.0 |
| 22 | HAL Generalist Agent | o3 Medium (April 2025) | 20.0 |
| 23 | HAL Generalist Agent | o4-mini High (April 2025) | 18.0 |
| 24 | HAL Generalist Agent | DeepSeek V3 (March 2025) | 18.0 |
| 25 | HAL Generalist Agent | GPT-4.1 (April 2025) | 16.0 |
Tau-Bench punishes any rule violation, and staying consistent across many turns is hard. Strong systems often land in the 40–60% range, and the reliability-focused pass^k numbers are lower still.
Browse the other benchmarks on the leaderboard.