Benchmarks · 2025

Terminal Bench Hard: Terminal-Bench Hard Subset

Name: Terminal Bench Hard: Terminal-Bench Hard Subset
Creator: Harbor Framework, scored by Artificial Analysis
Published: 2025
Keywords: Terminal Bench Hard, AI benchmark, text model evaluation, Harbor Framework, scored by Artificial Analysis

The harder tier of Terminal Bench, scored by Artificial Analysis as an agent stress test.

Open Dataset

Models Tested

126

Top Score

62.9

Published

2025

Source

Harbor Framework, scored by Artificial Analysis

How It Works

Terminal Bench Hard is the harder slice of Terminal Bench. Tasks demand more planning, deeper error recovery, and longer chains of tool calls in a real Linux sandbox. It is a stress test for agent harnesses and the closest public proxy for "can this model handle a multi-hour devops task".

Same as Terminal Bench: each task has a verifier script that checks the final filesystem state. The Hard subset uses tighter budgets and harder problems. Scoring is task pass-rate.

Dataset size

A curated subset of Terminal Bench tasks selected for higher difficulty and longer step counts.

Mean score

27.4

Median score

29.5

Open / Closed

53 / 73

Top Scorers

#	Model	Lab	Source	Score
01	Claude Fable 5	Anthropic	Closed	62.9
02	GPT-5.5	OpenAI	Closed	60.6
03	GPT-5.4	OpenAI	Closed	57.6
04	GPT-5.4 High	OpenAI	Closed	57.6
05	Gemini 3.1 Pro Preview	Google	Closed	53.8
06	GPT-5.4 Mini High	OpenAI	Closed	52.3
07	Claude Opus 4.7 Thinking	Anthropic	Closed	51.5
08	Claude Opus 4.7	Anthropic	Closed	51.5
09	GLM-5.2	Z.ai	Open	50.8
10	Qwen 3.7 Max	Alibaba	Closed	50.8
11	Claude Opus 4.6	Anthropic	Closed	48.5
12	GPT-5.2	OpenAI	Closed	47.0
13	GPT-5.2 High	OpenAI	Closed	47.0
14	Claude Opus 4.5 (Thinking 32K)	Anthropic	Closed	47.0
15	DeepSeek-V4-Pro	DeepSeek	Open	46.2

Score Distribution

Open vs Closed Source

Gap on Terminal Bench Hard:+12.1pts closed leads

Top Open-Source Models

1GLM-5.250.8
2DeepSeek-V4-Pro46.2
3Kimi K2.7 Code44.7

Top Closed-Source Models

1Claude Fable 562.9
2GPT-5.560.6
3GPT-5.457.6

Score vs Parameter Count

71 model(s) with undisclosed parameter counts not shown. Most closed-source labs do not publish model size.

Average Score by Lab

Xiaomi
41.9n = 3
Anthropic
39.5n = 16
MiniMax
38.9n = 3
Z.ai
36.6n = 6
Moonshot AI
32.3n = 6
OpenAI
31.2n = 22
Alibaba
29.8n = 14
DeepSeek
26.9n = 6
xAI
26.5n = 9
Google
20.4n = 19

Most Correlated Benchmarks

AA Intelligence Index
+0.96n = 126
HLE
+0.87n = 126
GPQA
+0.85n = 126
SciCode
+0.84n = 126
Terminal Bench
+0.84n = 17
AA LCR
+0.80n = 125
SWE-Pro
+0.79n = 14
LiveCodeBench
+0.78n = 72
Arena Score
+0.75n = 102
MMLU-PRO
+0.72n = 83
IFBench
+0.71n = 125
MATH-500
+0.60n = 38
SWE-Verified
+0.60n = 19
AIME 2026
+0.54n = 10
HMMT 2026
+0.50n = 9
Pearson r: −1 to +1. Positive means the two benchmarks rank models in similar order; negative means the opposite.

What It Captures Well

Discriminates between top-tier agent stacks where regular Terminal Bench saturates.
Measures planning, tool use, and error recovery together.
Scored by a third party using a shared harness.

Where It Falls Short

Scaffold-dependent: model and harness contribute jointly to the score.
Resource-intensive to evaluate.
Linux-only.

Frequently Asked Questions

Should I read this score or the regular Terminal Bench?

For mid-tier models, the regular score gives a better signal. For top-tier models that approach saturation on regular Terminal Bench, the Hard subset is the more useful read.

Related Benchmarks

Based on score correlations across our database.

Pearson r +0.96

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

Benchmarks · 2025

Terminal Bench Hard: Terminal-Bench Hard Subset

The harder tier of Terminal Bench, scored by Artificial Analysis as an agent stress test.

Open Dataset

Models Tested

126

Top Score

62.9

Published

2025

Source

Harbor Framework, scored by Artificial Analysis

How It Works

Same as Terminal Bench: each task has a verifier script that checks the final filesystem state. The Hard subset uses tighter budgets and harder problems. Scoring is task pass-rate.

Dataset size

A curated subset of Terminal Bench tasks selected for higher difficulty and longer step counts.

Mean score

27.4

Median score

29.5

Open / Closed

53 / 73

Top Scorers

#	Model	Lab	Source	Score
01	Claude Fable 5	Anthropic	Closed	62.9
02	GPT-5.5	OpenAI	Closed	60.6
03	GPT-5.4	OpenAI	Closed	57.6
04	GPT-5.4 High	OpenAI	Closed	57.6
05	Gemini 3.1 Pro Preview	Google	Closed	53.8
06	GPT-5.4 Mini High	OpenAI	Closed	52.3
07	Claude Opus 4.7 Thinking	Anthropic	Closed	51.5
08	Claude Opus 4.7	Anthropic	Closed	51.5
09	GLM-5.2	Z.ai	Open	50.8
10	Qwen 3.7 Max	Alibaba	Closed	50.8
11	Claude Opus 4.6	Anthropic	Closed	48.5
12	GPT-5.2	OpenAI	Closed	47.0
13	GPT-5.2 High	OpenAI	Closed	47.0
14	Claude Opus 4.5 (Thinking 32K)	Anthropic	Closed	47.0
15	DeepSeek-V4-Pro	DeepSeek	Open	46.2

Score Distribution

Open vs Closed Source

Gap on Terminal Bench Hard:+12.1pts closed leads

Top Open-Source Models

1GLM-5.250.8
2DeepSeek-V4-Pro46.2
3Kimi K2.7 Code44.7

Top Closed-Source Models

1Claude Fable 562.9
2GPT-5.560.6
3GPT-5.457.6

Score vs Parameter Count

71 model(s) with undisclosed parameter counts not shown. Most closed-source labs do not publish model size.

Average Score by Lab

Xiaomi
41.9n = 3
Anthropic
39.5n = 16
MiniMax
38.9n = 3
Z.ai
36.6n = 6
Moonshot AI
32.3n = 6
OpenAI
31.2n = 22
Alibaba
29.8n = 14
DeepSeek
26.9n = 6
xAI
26.5n = 9
Google
20.4n = 19

Most Correlated Benchmarks

AA Intelligence Index
+0.96n = 126
HLE
+0.87n = 126
GPQA
+0.85n = 126
SciCode
+0.84n = 126
Terminal Bench
+0.84n = 17
AA LCR
+0.80n = 125
SWE-Pro
+0.79n = 14
LiveCodeBench
+0.78n = 72
Arena Score
+0.75n = 102
MMLU-PRO
+0.72n = 83
IFBench
+0.71n = 125
MATH-500
+0.60n = 38
SWE-Verified
+0.60n = 19
AIME 2026
+0.54n = 10
HMMT 2026
+0.50n = 9
Pearson r: −1 to +1. Positive means the two benchmarks rank models in similar order; negative means the opposite.

What It Captures Well

Discriminates between top-tier agent stacks where regular Terminal Bench saturates.
Measures planning, tool use, and error recovery together.
Scored by a third party using a shared harness.

Where It Falls Short

Scaffold-dependent: model and harness contribute jointly to the score.
Resource-intensive to evaluate.
Linux-only.

Frequently Asked Questions

Should I read this score or the regular Terminal Bench?

For mid-tier models, the regular score gives a better signal. For top-tier models that approach saturation on regular Terminal Bench, the Hard subset is the more useful read.

Related Benchmarks

Based on score correlations across our database.

Pearson r +0.96

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

Terminal Bench Hard: Terminal-Bench Hard Subset

How It Works

Top Scorers

Score Distribution

Open vs Closed Source

Score vs Parameter Count

Average Score by Lab

Most Correlated Benchmarks

What It Captures Well

Where It Falls Short

Frequently Asked Questions

Related Benchmarks

AA Intelligence Index

HLE

GPQA

SciCode

The AI Build Report

Terminal Bench Hard: Terminal-Bench Hard Subset

How It Works

Top Scorers

Score Distribution

Open vs Closed Source

Score vs Parameter Count

Average Score by Lab

Most Correlated Benchmarks

What It Captures Well

Where It Falls Short

Frequently Asked Questions

Related Benchmarks

AA Intelligence Index

HLE

GPQA

SciCode

The AI Build Report