Agent Benchmark · 2024

SWE-bench Verified Mini: SWE-bench Verified Mini

Name: SWE-bench Verified Mini: SWE-bench Verified Mini
Creator: Princeton (HAL)
Published: 2024
Keywords: SWE-bench Verified Mini, AI agent benchmark, Princeton (HAL)

A smaller, cheaper slice of SWE-bench Verified used to compare coding agents without a huge compute bill.

View Dataset Read Paper

Systems Ranked

Top Score

72.0

Published

2024

Source

Princeton (HAL)

How It Works

SWE-bench Verified Mini is a curated 50-task subset of SWE-bench Verified. It tracks the full benchmark closely while costing a fraction as much to run, which makes it a practical screen for coding agents before committing to the full 500-task evaluation.

Identical scoring to SWE-bench Verified: a task is resolved only when the agent's patch passes the hidden tests. Scores are the percentage of the 50 tasks resolved.

Dataset size

A 50-task subset of SWE-bench Verified.

Agent type

Coding

Published by

Princeton (HAL)

Year

2024

Top Agent Systems

#	Agent System	Model	Score
01	SWE-Agent	Claude Sonnet 4.5 High (September 2025)	72.0
02	SWE-Agent	Claude Sonnet 4.5 (September 2025)	68.0
03	SWE-Agent	Claude Opus 4.1 (August 2025)	61.0
04	SWE-Agent	Claude Opus 4.1 High (August 2025)	54.0
05	SWE-Agent	Claude-3.7 Sonnet High (February 2025)	54.0
06	SWE-Agent	o4-mini Low (April 2025)	54.0
07	SWE-Agent	Claude Opus 4 (May 2025)	50.0
08	SWE-Agent	Claude-3.7 Sonnet (February 2025)	50.0
09	SWE-Agent	o4-mini High (April 2025)	50.0
10	SWE-Agent	o3 Medium (April 2025)	46.0
11	HAL Generalist Agent	Claude Opus 4.1 High (August 2025)	46.0
12	SWE-Agent	GPT-5 Medium (August 2025)	46.0
13	SWE-Agent	GPT-4.1 (April 2025)	44.0
14	HAL Generalist Agent	Claude Haiku 4.5 High (October 2025)	44.0
15	HAL Generalist Agent	Claude Opus 4.1 (August 2025)	42.0
16	HAL Generalist Agent	Claude Sonnet 4.5 High (September 2025)	40.0
17	HAL Generalist Agent	Claude Sonnet 4.5 (September 2025)	34.0
18	HAL Generalist Agent	Claude Opus 4 (May 2025)	34.0
19	HAL Generalist Agent	Claude Opus 4 High (May 2025)	30.0
20	HAL Generalist Agent	Claude-3.7 Sonnet (February 2025)	26.0
21	HAL Generalist Agent	Claude Haiku 4.5 (October 2025)	24.0
22	HAL Generalist Agent	Claude-3.7 Sonnet High (February 2025)	24.0
23	SWE-Agent	DeepSeek V3 (March 2025)	24.0
24	SWE-Agent	Gemini 2.0 Flash (February 2025)	24.0
25	HAL Generalist Agent	GPT-5 Medium (August 2025)	12.0

Strengths

Cheap enough to run often, so it suits rapid iteration.
Same pass/fail rigor as the full benchmark.
Good for spotting regressions before a full run.

Limitations

50 tasks is small, so single failures move the score a lot.
Not a substitute for the full benchmark on a final report.
Inherits the Python-only and library-skew limits of the parent set.

Frequently Asked Questions

Should I trust Mini scores as much as the full benchmark?

Treat Mini as a fast screen, not a final verdict. It correlates well with the full set but has more noise. For a published number, use the full 500-task SWE-bench Verified.

Other Agent Benchmarks

Browse the other benchmarks on the leaderboard.

Coding