A live agent test that drops a model into a real Linux shell and asks it to complete real engineering tasks.
Terminal Bench measures how well a model can drive a real shell. The agent is given a goal — install a package, debug a failing build, set up a service — and a sandboxed Linux environment. Success requires picking the right commands, parsing real output, recovering from errors, and finishing in a bounded number of steps. It is the most operational benchmark in this set, closer to "can this model run my devops" than any other.
Each task has a verifier — usually a script that checks the final filesystem and process state. The model agent runs commands, reads stdout, and iterates until it believes the task is complete. Scoring is task pass-rate. Token and step budgets vary by task; most leaderboards report results inside a fixed agent harness.
| # | Model | Lab | Source | Score |
|---|---|---|---|---|
| 01 | Claude Opus 4.6 | Anthropic | Closed | 74.7 |
| 02 | DeepSeek-V4-Pro | DeepSeek | Open | 67.9 |
| 03 | Kimi K2.6 | Moonshot AI | Open | 66.7 |
| 04 | GPT-5.2 | OpenAI | Closed | 64.9 |
| 05 | Gemini 3 Flash | Closed | 64.3 | |
| 06 | GLM-5.1 | Z.ai | Open | 63.5 |
| 07 | Qwen3.6-27B | Alibaba |
6 model(s) with undisclosed parameter counts not shown. Most closed-source labs do not publish model size.
Both. The score reflects the model plus the scaffold around it. Two leaderboard entries for the same base model can differ by 30+ points depending on tool design, retry policy, and memory.
Both reward planning, tool use, and recovery from errors. Models that solve Terminal Bench tasks tend to be good at SWE-Verified style issues, and vice versa. They are not the same test, but they pull on the same underlying agent ability.
Based on score correlations across our database.
| Open |
| 59.3 |
| 08 | DeepSeek-V4-Flash | DeepSeek | Open | 56.9 |
| 09 | Claude Sonnet 4.6 | Anthropic | Closed | 53.0 |
| 10 | Qwen3.5-397B-A17B | Alibaba | Open | 52.5 |
| 11 | GLM-5 | Z.ai | Open | 52.4 |
| 12 | Qwen3.6 35B-A3B | Alibaba | Open | 51.5 |
| 13 | Claude Sonnet 4.5 | Anthropic | Closed | 51.0 |
| 14 | Qwen3.5-122B-A10B | Alibaba | Open | 49.4 |
| 15 | Kimi K2.5 | Moonshot AI | Open | 43.2 |