Five hundred real GitHub issues, hand-checked by engineers, that test whether a model can ship a working code change.
SWE-Verified is the closest thing the field has to a real-world software engineering test. Each task gives the model a repository, an open issue, and the project test suite. The model has to produce a code patch that resolves the issue and passes the hidden tests. Humans verified that each issue is well-specified and solvable, so a failure points at the model, not at a broken benchmark.
A model agent is given the repo, the issue, and a sandboxed shell. It can read files, run commands, and edit code. The patch is scored pass or fail based on the project test suite, including hidden regression tests written by the original maintainers. Score is the fraction of issues resolved correctly.
| # | Model | Lab | Source | Score |
|---|---|---|---|---|
| 01 | DeepSeek-V4-Pro | DeepSeek | Open | 80.6 |
| 02 | Kimi K2.6 | Moonshot AI | Open | 80.2 |
| 03 | DeepSeek-V4-Flash | DeepSeek | Open | 79.0 |
| 04 | Mistral Medium 3.5 | Mistral AI | Open | 77.6 |
| 05 | Qwen3.6-27B | Alibaba | Open | 77.2 |
| 06 | Qwen3.5-397B-A17B | Alibaba | Open | 76.4 |
| 07 | minimax-m2.5 | MiniMax | Open | 75.8 |
| 08 | Ring-2.6-1T | inclusionAI | Open | 74.0 |
| 09 | GLM-4.7 | Z.ai | Open | 73.8 |
| 10 | Qwen3.6 35B-A3B | Alibaba | Open | 73.4 |
| 11 | GLM-5 | Z.ai | Open | 72.8 |
| 12 | Qwen3.5-27B | Alibaba | Open | 72.4 |
| 13 | Qwen3.5-122B-A10B | Alibaba | Open | 72.0 |
| 14 | Kimi K2 Thinking | Moonshot AI | Open | 71.3 |
| 15 | Kimi K2.5 | Moonshot AI | Open | 70.8 |
No models in this category.
Frontier closed models with strong agent scaffolds are above 70% in 2026. Strong open-weight models in similar scaffolds are in the 40–60% range. Models below 20% are not yet useful as autonomous coding agents on this kind of task.
Both. SWE-Verified is famously sensitive to scaffolding: the same base model can swing 20 points based on the agent loop, tools, and retry strategy. Compare scores within the same scaffold whenever possible.
SWE-Verified is human-curated open-source Python issues. SWE-Pro is longer, harder, enterprise-style tasks that demand more planning. SWE-Pro scores are generally much lower than SWE-Verified for the same model.
Based on score correlations across our database.