Real GitHub issues an agent has to fix by editing a codebase until the project test suite passes.
SWE-bench Verified gives an agent a real bug report from a popular open-source Python project and the repository it lives in. The agent has to find the right files, write a patch, and make the hidden tests pass. It is the closest public test to "can this agent do a junior engineer's ticket end to end."
Each task ships with the repository state, the issue text, and a hidden set of tests. A run counts as resolved only if the agent's patch makes the failing tests pass without breaking the passing ones. Scores are the percentage of the 500 tasks resolved.
No scored systems for this benchmark yet. Check back after the next weekly sync.
The same tasks are used, but the score depends heavily on the agent scaffold around the model: how it searches the repo, runs tests, and retries. Two systems on the same model can land 20 points apart, which is exactly what this leaderboard surfaces.
Top agent systems in 2026 resolve roughly 65–75% of tasks. A year earlier the best were near 50%, so this number moves fast.
Browse the other benchmarks on the leaderboard.