Long-horizon, enterprise-style coding tasks that take human engineers hours, not minutes.
SWE-Pro raises the bar past SWE-Verified. The tasks are larger, the codebases are bigger, the changes span multiple files, and the test suites are deeper. A successful run looks like a full pull request rather than a small patch. SWE-Pro measures whether a model and its agent harness can act like a junior engineer working on a feature for an afternoon.
Tasks ship with a repository snapshot, a description, and a hidden test suite. The agent makes changes, runs tests, and iterates. Scoring is the fraction of tasks that pass all hidden tests within a fixed step or token budget.
| # | Model | Lab | Source | Score |
|---|---|---|---|---|
| 01 | Kimi K2.6 | Moonshot AI | Open | 58.6 |
| 02 | GLM-5.1 | Z.ai | Open | 58.4 |
| 03 | DeepSeek-V4-Pro | DeepSeek | Open | 55.4 |
| 04 | minimax-m2.5 | MiniMax | Open | 55.4 |
| 05 | Qwen3.6-27B | Alibaba | Open | 53.5 |
| 06 | Kimi K2.5 | Moonshot AI | Open | 50.7 |
| 07 | Qwen3.6 35B-A3B |
No models in this category.
The tasks are longer and the codebases are bigger. A SWE-Verified issue typically needs a 5–50 line patch. SWE-Pro tasks routinely need hundreds of lines across several files, with careful test wiring.
For senior-engineer-level autonomy, yes. For day-to-day code completion, no. Pair it with a fast feedback benchmark like HumanEval++ or a real evaluation in your own repo.
Based on score correlations across our database.
| Alibaba |
| Open |
| 49.5 |
| 08 | Kimi K2 Instruct | Moonshot AI | Open | 27.7 |
| 09 | Qwen3-235B-A22B | Alibaba | Open | 21.4 |
| 10 | GLM-4.6 | Z.ai | Open | 9.7 |