A live agent test that drops a model into a real Linux shell and asks it to complete real engineering tasks.
Terminal Bench measures how well a model can drive a real shell. The agent is given a goal (install a package, debug a failing build, set up a service) and a sandboxed Linux environment. Success requires picking the right commands, parsing real output, recovering from errors, and finishing in a bounded number of steps. It is the most operational benchmark in this set, closer to "can this model run my devops" than any other.
Each task has a verifier, usually a script that checks the final filesystem and process state. The model agent runs commands, reads stdout, and iterates until it believes the task is complete. Scoring is task pass-rate. Token and step budgets vary by task; most leaderboards report results inside a fixed agent harness.
| # | Model | Lab | Source | Score |
|---|---|---|---|---|
| 01 | DeepSeek-V4-Pro | DeepSeek | Open | 67.9 |
| 02 | Kimi K2.6 | Moonshot AI | Open | 66.7 |
| 03 | GLM-5.1 | Z.ai | Open | 63.5 |
| 04 | Qwen3.6-27B | Alibaba | Open | 59.3 |
| 05 | DeepSeek-V4-Flash | DeepSeek | Open | 56.9 |
| 06 | Qwen3.5-397B-A17B | Alibaba | Open | 52.5 |
| 07 | GLM-5 | Z.ai | Open | 52.4 |
| 08 | Qwen3.6 35B-A3B | Alibaba | Open | 51.5 |
| 09 | Qwen3.5-122B-A10B | Alibaba | Open | 49.4 |
| 10 | Kimi K2.5 | Moonshot AI | Open | 43.2 |
| 11 | Qwen3.5-27B | Alibaba | Open | 41.6 |
| 12 | Kimi K2 Thinking | Moonshot AI | Open | 35.7 |
| 13 | GLM-4.7 | Z.ai | Open | 33.4 |
| 14 | Nvidia Nemotron 3 Super | NVIDIA | Open | 31.0 |
| 15 | Kimi K2 Instruct | Moonshot AI | Open | 27.8 |
No models in this category.
Both. The score reflects the model plus the scaffold around it. Two leaderboard entries for the same base model can differ by 30+ points depending on tool design, retry policy, and memory.
Both reward planning, tool use, and recovery from errors. Models that solve Terminal Bench tasks tend to be good at SWE-Verified style issues, and vice versa. They are not the same test, but they pull on the same underlying agent ability.
Based on score correlations across our database.