Benchmarks · 2024

EvasionBench: EvasionBench

Name: EvasionBench: EvasionBench
Creator: FutureMa Research
Published: 2024
Keywords: EvasionBench, AI benchmark, text model evaluation, FutureMa Research

Sixteen thousand earnings-call Q&A pairs that test whether a model can spot when an executive is dodging the question.

Open Dataset

Models Tested

Top Score

82.9

Published

2024

Source

FutureMa Research

How It Works

EvasionBench is a domain-specific reasoning test built around corporate communication. The model reads an analyst question and an executive answer, then judges whether the answer actually addresses the question. This kind of nuanced reading is core to finance, sales, legal review, and any workflow that turns unstructured talk into structured insight.

Each Q&A pair has a gold label. The model classifies the answer and is scored on accuracy. Some leaderboards also score on the model's short justification, which catches false-positive predictions where the label is right for the wrong reason.

Dataset size

16,726 earnings-call question/answer pairs labeled as direct, evasive, or partially evasive.

Mean score

72.2

Median score

66.9

Open / Closed

3 / 0

Top Scorers

#	Model	Lab	Source	Score
01	GLM-4.7	Z.ai	Open	82.9
02	DeepSeek-V3.2	DeepSeek	Open	66.9
03	Kimi K2 Instruct 0905	Moonshot AI	Open	66.7

Score Distribution

Open vs Closed Source

Top Open-Source Models

1GLM-4.782.9
2DeepSeek-V3.266.9
3Kimi K2 Instruct 090566.7

Top Closed-Source Models

No models in this category.

Score vs Parameter Count

Most Correlated Benchmarks

Not enough scored models yet.

What It Captures Well

Real-world domain that does not appear in most reasoning benchmarks.
Tests subtle natural-language inference — much harder than sentiment or topic classification.
Useful proxy for any workflow that needs to grade evasive or hedged language.

Where It Falls Short

Single-domain — does not generalize to consumer chat or other professional verticals.
Labels can be debatable; humans disagree on edge cases.
English-only earnings calls.

Frequently Asked Questions

Who should care about EvasionBench?

Anyone shipping AI into finance, legal, sales-call analysis, or compliance. The score is a strong signal for whether a model can read between the lines of formal business language.

Does a model need finance training data to score well?

Helps, but is not required. The strongest performers also score well on general reasoning benchmarks. EvasionBench is a downstream stress test, not a domain knowledge test.