Benchmarks · 2025

Terminal Bench: Terminal Bench

Name: Terminal Bench: Terminal Bench
Creator: Harbor Framework
Published: 2025
Keywords: Terminal Bench, AI benchmark, text model evaluation, Harbor Framework

A live agent test that drops a model into a real Linux shell and asks it to complete real engineering tasks.

Open Dataset

Models Tested

Top Score

74.7

Published

2025

Source

Harbor Framework

How It Works

Terminal Bench measures how well a model can drive a real shell. The agent is given a goal — install a package, debug a failing build, set up a service — and a sandboxed Linux environment. Success requires picking the right commands, parsing real output, recovering from errors, and finishing in a bounded number of steps. It is the most operational benchmark in this set, closer to "can this model run my devops" than any other.

Each task has a verifier — usually a script that checks the final filesystem and process state. The model agent runs commands, reads stdout, and iterates until it believes the task is complete. Scoring is task pass-rate. Token and step budgets vary by task; most leaderboards report results inside a fixed agent harness.

Dataset size

A growing suite of tasks that include compiling code, fixing broken builds, navigating filesystems, and using common CLI tools.

Mean score

49.2

Median score

51.3

Open / Closed

18 / 6

Top Scorers

#	Model	Lab	Source	Score
01	Claude Opus 4.6	Anthropic	Closed	74.7
02	DeepSeek-V4-Pro	DeepSeek	Open	67.9
03	Kimi K2.6	Moonshot AI	Open	66.7
04	GPT-5.2	OpenAI	Closed	64.9
05	Gemini 3 Flash	Google	Closed	64.3
06	GLM-5.1	Z.ai	Open	63.5
07	Qwen3.6-27B	Alibaba

Score Distribution

Open vs Closed Source

Gap on Terminal Bench:+6.8pts closed leads

Top Open-Source Models

1DeepSeek-V4-Pro67.9
2Kimi K2.666.7
3GLM-5.163.5

Top Closed-Source Models

1Claude Opus 4.674.7
2GPT-5.264.9
3Gemini 3 Flash64.3

Score vs Parameter Count

6 model(s) with undisclosed parameter counts not shown. Most closed-source labs do not publish model size.

Average Score by Lab

DeepSeek
54.8n = 3
Anthropic
53.5n = 4
Alibaba
49.1n = 6
Z.ai
43.5n = 4
Moonshot AI
43.4n = 4

Most Correlated Benchmarks

SWE-Pro
+0.82n = 12
GPQA
+0.78n = 22
SWE-Verified
+0.76n = 21
Arena Score
+0.72n = 20
AIME 2026

What It Captures Well

Tests grounded, real-world capability rather than text-only reasoning.
Catches models that hallucinate command flags or fail to parse output.
Reflects the kind of work that AI engineers will actually delegate to agents.

Where It Falls Short

Score is dominated by scaffolding and tooling choices, not just the base model.
Linux-only — does not measure Windows or macOS shell competence.
Resource-intensive to run, which slows the cadence of new evaluations.

Frequently Asked Questions

Is Terminal Bench a model benchmark or an agent benchmark?

Both. The score reflects the model plus the scaffold around it. Two leaderboard entries for the same base model can differ by 30+ points depending on tool design, retry policy, and memory.

Why does Terminal Bench correlate with SWE-Verified?

Both reward planning, tool use, and recovery from errors. Models that solve Terminal Bench tasks tend to be good at SWE-Verified style issues, and vice versa. They are not the same test, but they pull on the same underlying agent ability.