Agent Benchmark · 2024

Tau-Bench Airline: τ-bench Airline

Name: Tau-Bench Airline: τ-bench Airline
Creator: Sierra
Published: 2024
Keywords: Tau-Bench Airline, AI agent benchmark, Sierra

A customer-service simulation where an agent has to follow airline policy and use tools to actually resolve a request.

View Dataset Read Paper

Systems Ranked

Top Score

56.0

Published

2024

Source

Sierra

How It Works

Tau-Bench drops the agent into a realistic airline support chat. A simulated customer asks to change or cancel a booking, and the agent has to call the right tools, follow the airline's written policy, and only make changes that are actually allowed. It measures whether an agent can stay on the rails over a multi-turn conversation, not just answer one question.

The agent talks to a simulated user and a set of tools backed by a database. Success means the database ends in the correct state and no policy rule was broken. Scores report the success rate, and a stricter "pass^k" variant checks whether the agent succeeds reliably across repeated attempts.

Dataset size

A set of airline support scenarios with policy rules.

Agent type

Tool Calling

Published by

Sierra

Year

2024

Top Agent Systems

#	Agent System	Model	Score
01	HAL Generalist Agent	Claude-3.7 Sonnet (February 2025)	56.0
02	TAU-bench Tool Calling	o4-mini High (April 2025)	56.0
03	HAL Generalist Agent	Claude Opus 4.1 (August 2025)	54.0
04	TAU-bench Tool Calling	o3 Medium (April 2025)	54.0
05	TAU-bench Tool Calling	Claude Opus 4.1 High (August 2025)	52.0
06	TAU-bench Tool Calling	Claude-3.7 Sonnet High (February 2025)	52.0
07	TAU-bench Tool Calling	Claude Opus 4.1 (August 2025)	50.0
08	TAU-bench Tool Calling	GPT-5 Medium (August 2025)	48.0
09	HAL Generalist Agent	Claude Opus 4 High (May 2025)	44.0
10	HAL Generalist Agent	Claude Opus 4 (May 2025)	44.0
11	HAL Generalist Agent	Claude-3.7 Sonnet High (February 2025)	44.0
12	TAU-bench Tool Calling	Claude-3.7 Sonnet (February 2025)	44.0
13	TAU-bench Tool Calling	DeepSeek V3 (March 2025)	44.0
14	TAU-bench Tool Calling	DeepSeek R1 (January 2025)	36.0
15	TAU-bench Tool Calling	GPT-4.1 (April 2025)	36.0
16	TAU-bench Tool Calling	o4-mini Low (April 2025)	36.0
17	HAL Generalist Agent	Claude Opus 4.1 High (August 2025)	32.0
18	HAL Generalist Agent	GPT-5 Medium (August 2025)	30.0
19	TAU-bench Tool Calling	Gemini 2.0 Flash High (February 2025)	28.0
20	HAL Generalist Agent	o4-mini Low (April 2025)	22.0
21	HAL Generalist Agent	Gemini 2.0 Flash (February 2025)	22.0
22	HAL Generalist Agent	o3 Medium (April 2025)	20.0
23	HAL Generalist Agent	o4-mini High (April 2025)	18.0
24	HAL Generalist Agent	DeepSeek V3 (March 2025)	18.0
25	HAL Generalist Agent	GPT-4.1 (April 2025)	16.0

Strengths

Tests reliability over a full conversation, not a single reply.
Built-in policy rules catch agents that take unsafe shortcuts.
Mirrors real support automation, a common business use case.

Limitations

Narrow domain, so it does not test general capability.
The simulated user can behave differently from real customers.
Tool and policy design influence scores as much as the model.

Frequently Asked Questions

Why do airline scores look low compared to coding benchmarks?

Tau-Bench punishes any rule violation, and staying consistent across many turns is hard. Strong systems often land in the 40–60% range, and the reliability-focused pass^k numbers are lower still.

Other Agent Benchmarks

Browse the other benchmarks on the leaderboard.

Research

GAIA

Real-world assistant questions that need web browsing, tool use, and multi-step reasoning to answer correctly.

Coding

SWE-bench Verified

Real GitHub issues an agent has to fix by editing a codebase until the project test suite passes.

Coding

SWE-bench Verified Mini

A smaller, cheaper slice of SWE-bench Verified used to compare coding agents without a huge compute bill.

Browser

AssistantBench

Time-consuming, realistic web tasks that require browsing many live pages to find one answer.

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

Agent Benchmark · 2024

Tau-Bench Airline: τ-bench Airline

A customer-service simulation where an agent has to follow airline policy and use tools to actually resolve a request.

View Dataset Read Paper

Systems Ranked

Top Score

56.0

Published

2024

Source

Sierra

How It Works

Dataset size

A set of airline support scenarios with policy rules.

Agent type

Tool Calling

Published by

Sierra

Year

2024

Top Agent Systems

#	Agent System	Model	Score
01	HAL Generalist Agent	Claude-3.7 Sonnet (February 2025)	56.0
02	TAU-bench Tool Calling	o4-mini High (April 2025)	56.0
03	HAL Generalist Agent	Claude Opus 4.1 (August 2025)	54.0
04	TAU-bench Tool Calling	o3 Medium (April 2025)	54.0
05	TAU-bench Tool Calling	Claude Opus 4.1 High (August 2025)	52.0
06	TAU-bench Tool Calling	Claude-3.7 Sonnet High (February 2025)	52.0
07	TAU-bench Tool Calling	Claude Opus 4.1 (August 2025)	50.0
08	TAU-bench Tool Calling	GPT-5 Medium (August 2025)	48.0
09	HAL Generalist Agent	Claude Opus 4 High (May 2025)	44.0
10	HAL Generalist Agent	Claude Opus 4 (May 2025)	44.0
11	HAL Generalist Agent	Claude-3.7 Sonnet High (February 2025)	44.0
12	TAU-bench Tool Calling	Claude-3.7 Sonnet (February 2025)	44.0
13	TAU-bench Tool Calling	DeepSeek V3 (March 2025)	44.0
14	TAU-bench Tool Calling	DeepSeek R1 (January 2025)	36.0
15	TAU-bench Tool Calling	GPT-4.1 (April 2025)	36.0
16	TAU-bench Tool Calling	o4-mini Low (April 2025)	36.0
17	HAL Generalist Agent	Claude Opus 4.1 High (August 2025)	32.0
18	HAL Generalist Agent	GPT-5 Medium (August 2025)	30.0
19	TAU-bench Tool Calling	Gemini 2.0 Flash High (February 2025)	28.0
20	HAL Generalist Agent	o4-mini Low (April 2025)	22.0
21	HAL Generalist Agent	Gemini 2.0 Flash (February 2025)	22.0
22	HAL Generalist Agent	o3 Medium (April 2025)	20.0
23	HAL Generalist Agent	o4-mini High (April 2025)	18.0
24	HAL Generalist Agent	DeepSeek V3 (March 2025)	18.0
25	HAL Generalist Agent	GPT-4.1 (April 2025)	16.0

Strengths

Tests reliability over a full conversation, not a single reply.
Built-in policy rules catch agents that take unsafe shortcuts.
Mirrors real support automation, a common business use case.

Limitations

Narrow domain, so it does not test general capability.
The simulated user can behave differently from real customers.
Tool and policy design influence scores as much as the model.

Frequently Asked Questions

Why do airline scores look low compared to coding benchmarks?

Tau-Bench punishes any rule violation, and staying consistent across many turns is hard. Strong systems often land in the 40–60% range, and the reliability-focused pass^k numbers are lower still.

Other Agent Benchmarks

Browse the other benchmarks on the leaderboard.

Research

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.