Five hundred real GitHub issues, hand-checked by engineers, that test whether a model can ship a working code change.
SWE-Verified is the closest thing the field has to a real-world software engineering test. Each task gives the model a repository, an open issue, and the project test suite. The model has to produce a code patch that resolves the issue and passes the hidden tests. Humans verified that each issue is well-specified and solvable, so a failure points at the model, not at a broken benchmark.
A model agent is given the repo, the issue, and a sandboxed shell. It can read files, run commands, and edit code. The patch is scored pass or fail based on the project test suite, including hidden regression tests written by the original maintainers. Score is the fraction of issues resolved correctly.
| # | Model | Lab | Source | Score |
|---|---|---|---|---|
| 01 | Claude Opus 4.6 | Anthropic | Closed | 80.8 |
| 02 | DeepSeek-V4-Pro | DeepSeek | Open | 80.6 |
| 03 | Kimi K2.6 | Moonshot AI | Open | 80.2 |
| 04 | GPT-5.2 | OpenAI | Closed | 80.0 |
| 05 | Claude Sonnet 4.6 | Anthropic | Closed | 79.6 |
| 06 | DeepSeek-V4-Flash | DeepSeek | Open | 79.0 |
| 07 | Qwen3.6-27B |
8 model(s) with undisclosed parameter counts not shown. Most closed-source labs do not publish model size.
Frontier closed models with strong agent scaffolds are above 70% in 2026. Strong open-weight models in similar scaffolds are in the 40–60% range. Models below 20% are not yet useful as autonomous coding agents on this kind of task.
Both. SWE-Verified is famously sensitive to scaffolding — the same base model can swing 20 points based on the agent loop, tools, and retry strategy. Compare scores within the same scaffold whenever possible.
SWE-Verified is human-curated open-source Python issues. SWE-Pro is longer, harder, enterprise-style tasks that demand more planning. SWE-Pro scores are generally much lower than SWE-Verified for the same model.
| Alibaba |
| Open |
| 77.2 |
| 08 | Claude Sonnet 4.5 | Anthropic | Closed | 77.2 |
| 09 | Qwen3.5-397B-A17B | Alibaba | Open | 76.4 |
| 10 | Gemini 3 Pro | Closed | 76.2 |
| 11 | minimax-m2.5 | MiniMax | Open | 75.8 |
| 12 | GPT-5.1 | OpenAI | Closed | 74.9 |
| 13 | GLM-4.7 | Z.ai | Open | 73.8 |
| 14 | Qwen3.6 35B-A3B | Alibaba | Open | 73.4 |
| 15 | GLM-5 | Z.ai | Open | 72.8 |