Conversational agent benchmark in the telecom customer-support domain, where context and turn-taking matter.
Tau-2 Bench Telecom tests agents on multi-turn customer-support conversations. The agent has to gather information from the customer, query backend systems, follow company policies, and resolve issues across multiple turns. Telecom is the hardest domain because policies are complex and customers often miscommunicate intent.
A simulated customer interacts with the agent across multiple turns. The agent must use the right tools, follow policies, and arrive at the correct resolution. Tasks have verifiable outcomes, scored as pass or fail.
No scores yet for this benchmark.
Not enough scored models yet.
Not enough scored models yet.
Tau-2 is the second-generation benchmark with cleaner tasks, better tooling, and a tougher simulated customer. Most current evaluations report Tau-2 results.
Based on score correlations across our database.