A smaller, cheaper slice of SWE-bench Verified used to compare coding agents without a huge compute bill.
SWE-bench Verified Mini is a curated 50-task subset of SWE-bench Verified. It tracks the full benchmark closely while costing a fraction as much to run, which makes it a practical screen for coding agents before committing to the full 500-task evaluation.
Identical scoring to SWE-bench Verified: a task is resolved only when the agent's patch passes the hidden tests. Scores are the percentage of the 50 tasks resolved.
| # | Agent System | Model | Score |
|---|---|---|---|
| 01 | SWE-Agent | Claude Sonnet 4.5 High (September 2025) | 72.0 |
| 02 | SWE-Agent | Claude Sonnet 4.5 (September 2025) | 68.0 |
| 03 | SWE-Agent | Claude Opus 4.1 (August 2025) | 61.0 |
| 04 | SWE-Agent | Claude Opus 4.1 High (August 2025) | 54.0 |
| 05 | SWE-Agent | Claude-3.7 Sonnet High (February 2025) | 54.0 |
| 06 | SWE-Agent | o4-mini Low (April 2025) | 54.0 |
| 07 | SWE-Agent | Claude Opus 4 (May 2025) | 50.0 |
| 08 | SWE-Agent | Claude-3.7 Sonnet (February 2025) | 50.0 |
| 09 | SWE-Agent | o4-mini High (April 2025) | 50.0 |
| 10 | SWE-Agent | o3 Medium (April 2025) | 46.0 |
| 11 | HAL Generalist Agent | Claude Opus 4.1 High (August 2025) | 46.0 |
| 12 | SWE-Agent | GPT-5 Medium (August 2025) | 46.0 |
| 13 | SWE-Agent | GPT-4.1 (April 2025) | 44.0 |
| 14 | HAL Generalist Agent | Claude Haiku 4.5 High (October 2025) | 44.0 |
| 15 | HAL Generalist Agent | Claude Opus 4.1 (August 2025) | 42.0 |
| 16 | HAL Generalist Agent | Claude Sonnet 4.5 High (September 2025) | 40.0 |
| 17 | HAL Generalist Agent | Claude Opus 4 (May 2025) | 34.0 |
| 18 | HAL Generalist Agent | Claude Sonnet 4.5 (September 2025) | 34.0 |
| 19 | HAL Generalist Agent | Claude Opus 4 High (May 2025) | 30.0 |
| 20 | HAL Generalist Agent | Claude-3.7 Sonnet (February 2025) | 26.0 |
| 21 | HAL Generalist Agent | Claude Haiku 4.5 (October 2025) | 24.0 |
| 22 | HAL Generalist Agent | Claude-3.7 Sonnet High (February 2025) | 24.0 |
| 23 | SWE-Agent | DeepSeek V3 (March 2025) | 24.0 |
| 24 | SWE-Agent | Gemini 2.0 Flash (February 2025) | 24.0 |
| 25 | HAL Generalist Agent | GPT-5 Medium (August 2025) | 12.0 |
Treat Mini as a fast screen, not a final verdict. It correlates well with the full set but has more noise. For a published number, use the full 500-task SWE-bench Verified.
Browse the other benchmarks on the leaderboard.