Long-horizon, enterprise-style coding tasks that take human engineers hours, not minutes.
SWE-Pro raises the bar past SWE-Verified. The tasks are larger, the codebases are bigger, the changes span multiple files, and the test suites are deeper. A successful run looks like a full pull request rather than a small patch. SWE-Pro measures whether a model and its agent harness can act like a junior engineer working on a feature for an afternoon.
Tasks ship with a repository snapshot, a description, and a hidden test suite. The agent makes changes, runs tests, and iterates. Scoring is the fraction of tasks that pass all hidden tests within a fixed step or token budget.
| # | Model | Lab | Source | Score |
|---|---|---|---|---|
| 01 | Gemini 3 Flash | Closed | 71.2 | |
| 02 | Kimi K2.6 | Moonshot AI | Open | 58.6 |
| 03 | GLM-5.1 | Z.ai | Open | 58.4 |
| 04 | GPT-5.4 | OpenAI | Closed | 57.7 |
| 05 | DeepSeek-V4-Pro | DeepSeek | Open | 55.4 |
| 06 | minimax-m2.5 | MiniMax | Open | 55.4 |
| 07 | Qwen3.6-27B | Alibaba | Open | 53.5 |
| 08 | Kimi K2.5 | Moonshot AI | Open | 50.7 |
| 09 | Qwen3.6 35B-A3B | Alibaba | Open | 49.5 |
| 10 | Claude Opus 4.6 | Anthropic | Closed | 45.0 |
| 11 | Kimi K2 Instruct | Moonshot AI | Open | 27.7 |
| 12 | Qwen3-235B-A22B | Alibaba | Open | 21.4 |
| 13 | DeepSeek-V3.2 | DeepSeek | Open | 15.6 |
| 14 | Claude Haiku 4.5 | Anthropic | Closed | 14.0 |
| 15 | Gemma 3 27B IT | Open | 11.4 |
4 model(s) with undisclosed parameter counts not shown. Most closed-source labs do not publish model size.
The tasks are longer and the codebases are bigger. A SWE-Verified issue typically needs a 5–50 line patch. SWE-Pro tasks routinely need hundreds of lines across several files, with careful test wiring.
For senior-engineer-level autonomy, yes. For day-to-day code completion, no — pair it with a fast feedback benchmark like HumanEval++ or a real evaluation in your own repo.
Based on score correlations across our database.