The harder tier of Terminal Bench, scored by Artificial Analysis as an agent stress test.
Terminal Bench Hard is the harder slice of Terminal Bench. Tasks demand more planning, deeper error recovery, and longer chains of tool calls in a real Linux sandbox. It is a stress test for agent harnesses and the closest public proxy for "can this model handle a multi-hour devops task".
Same as Terminal Bench: each task has a verifier script that checks the final filesystem state. The Hard subset uses tighter budgets and harder problems. Scoring is task pass-rate.
| # | Model | Lab | Source | Score |
|---|---|---|---|---|
| 01 | GPT-5.5 | OpenAI | Closed | 60.6 |
| 02 | GPT-5.4 | OpenAI | Closed | 57.6 |
| 03 | Gemini 3.1 Pro Preview | Closed | 53.8 | |
| 04 | Claude Opus 4.7 | Anthropic | Closed | 51.5 |
| 05 | Claude Opus 4.6 | Anthropic | Closed | 48.5 |
| 06 | GPT-5.2 | OpenAI | Closed | 47.0 |
| 07 | DeepSeek-V4-Pro | DeepSeek | Open | 46.2 |
| 08 |
30 model(s) with undisclosed parameter counts not shown. Most closed-source labs do not publish model size.
For mid-tier models, the regular score gives a better signal. For top-tier models that approach saturation on regular Terminal Bench, the Hard subset is the more useful read.
Based on score correlations across our database.
| Claude Sonnet 4.6 |
| Anthropic |
| Closed |
| 46.2 |
| 09 | GPT-5.1 | OpenAI | Closed | 45.5 |
| 10 | Qwen3.6 Max Preview | Alibaba | Closed | 43.9 |
| 11 | Kimi K2.6 | Moonshot AI | Open | 43.9 |
| 12 | Qwen3.6-Plus | Alibaba | Closed | 43.9 |
| 13 | MiMo-V2.5-Pro | Xiaomi | Closed | 43.2 |
| 14 | GLM-5 | Z.ai | Open | 43.2 |
| 15 | GLM-5.1 | Z.ai | Open | 43.2 |