Sixteen thousand earnings-call Q&A pairs that test whether a model can spot when an executive is dodging the question.
EvasionBench is a domain-specific reasoning test built around corporate communication. The model reads an analyst question and an executive answer, then judges whether the answer actually addresses the question. This kind of nuanced reading is core to finance, sales, legal review, and any workflow that turns unstructured talk into structured insight.
Each Q&A pair has a gold label. The model classifies the answer and is scored on accuracy. Some leaderboards also score on the model's short justification, which catches false-positive predictions where the label is right for the wrong reason.
| # | Model | Lab | Source | Score |
|---|---|---|---|---|
| 01 | GLM-4.7 | Z.ai | Open | 82.9 |
| 02 | DeepSeek-V3.2 | DeepSeek | Open | 66.9 |
| 03 | Kimi K2 Instruct 0905 | Moonshot AI | Open | 66.7 |
No models in this category.
Not enough scored models yet.
Anyone shipping AI into finance, legal, sales-call analysis, or compliance. The score is a strong signal for whether a model can read between the lines of formal business language.
Helps, but is not required. The strongest performers also score well on general reasoning benchmarks. EvasionBench is a downstream stress test, not a domain knowledge test.
Based on score correlations across our database.