Benchmarks · 2025

SWE-Pro: SWE-bench Pro

Name: SWE-Pro: SWE-bench Pro
Creator: Scale AI and Princeton
Published: 2025
Keywords: SWE-Pro, AI benchmark, text model evaluation, Scale AI and Princeton

Long-horizon, enterprise-style coding tasks that take human engineers hours, not minutes.

Open Dataset

Models Tested

Top Score

71.2

Published

2025

Source

Scale AI and Princeton

How It Works

SWE-Pro raises the bar past SWE-Verified. The tasks are larger, the codebases are bigger, the changes span multiple files, and the test suites are deeper. A successful run looks like a full pull request rather than a small patch. SWE-Pro measures whether a model and its agent harness can act like a junior engineer working on a feature for an afternoon.

Tasks ship with a repository snapshot, a description, and a hidden test suite. The agent makes changes, runs tests, and iterates. Scoring is the fraction of tasks that pass all hidden tests within a fixed step or token budget.

Dataset size

A set of high-difficulty software engineering tasks across multiple languages and frameworks, each requiring substantial code changes.

Mean score

35.6

Median score

45.0

Open / Closed

15 / 4

Top Scorers

#	Model	Lab	Source	Score
01	Gemini 3 Flash	Google	Closed	71.2
02	Kimi K2.6	Moonshot AI	Open	58.6
03	GLM-5.1	Z.ai	Open	58.4
04	GPT-5.4	OpenAI	Closed	57.7
05	DeepSeek-V4-Pro	DeepSeek	Open	55.4
06	minimax-m2.5	MiniMax	Open	55.4
07	Qwen3.6-27B	Alibaba	Open	53.5
08	Kimi K2.5	Moonshot AI	Open	50.7
09	Qwen3.6 35B-A3B	Alibaba	Open	49.5
10	Claude Opus 4.6	Anthropic	Closed	45.0
11	Kimi K2 Instruct	Moonshot AI	Open	27.7
12	Qwen3-235B-A22B	Alibaba	Open	21.4
13	DeepSeek-V3.2	DeepSeek	Open	15.6
14	Claude Haiku 4.5	Anthropic	Closed	14.0
15	Gemma 3 27B IT	Google	Open	11.4

Score Distribution

Open vs Closed Source

Gap on SWE-Pro:+12.6pts closed leads

Top Open-Source Models

1Kimi K2.658.6
2GLM-5.158.4
3DeepSeek-V4-Pro55.4

Top Closed-Source Models

1Gemini 3 Flash71.2
2GPT-5.457.7
3Claude Opus 4.645

Score vs Parameter Count

4 model(s) with undisclosed parameter counts not shown. Most closed-source labs do not publish model size.

Average Score by Lab

Moonshot AI
45.7n = 3
Alibaba
41.5n = 3
Google
41.3n = 2
DeepSeek
35.5n = 2
Z.ai
34.0n = 2
Anthropic
29.5n = 2
Meta
7.2n = 3

Most Correlated Benchmarks

GPQA
+0.87n = 13
Terminal Bench
+0.82n = 12
SWE-Verified
+0.74n = 11
Arena Score
+0.71n = 16
MMLU-PRO
+0.69n = 8
AIME 2026
+0.63n = 9
HLE
+0.29n = 12
Pearson r: −1 to +1. Positive means the two benchmarks rank models in similar order; negative means the opposite.

What It Captures Well

Closest public proxy for actual production engineering work.
Multi-language and multi-framework, so it reflects the real distribution of stacks.
Hidden tests make it harder to game without genuine engineering ability.

Where It Falls Short

Brand new in 2025, so the leaderboard has fewer entries and methodology is still settling.
Even more scaffold-dependent than SWE-Verified.
Expensive to run end-to-end, which limits how many times labs can re-evaluate.

Frequently Asked Questions

Why is SWE-Pro so much harder than SWE-Verified?

The tasks are longer and the codebases are bigger. A SWE-Verified issue typically needs a 5–50 line patch. SWE-Pro tasks routinely need hundreds of lines across several files, with careful test wiring.

Is SWE-Pro the right benchmark for evaluating coding agents?

For senior-engineer-level autonomy, yes. For day-to-day code completion, no — pair it with a fast feedback benchmark like HumanEval++ or a real evaluation in your own repo.

Related Benchmarks

Based on score correlations across our database.

Pearson r +0.87

Picking the Right Model for Your Use Case?

We help product and engineering teams turn benchmark scores into shipped features. Free first conversation.

Benchmarks · 2025

SWE-Pro: SWE-bench Pro

Long-horizon, enterprise-style coding tasks that take human engineers hours, not minutes.

Open Dataset

Models Tested

Top Score

71.2

Published

2025

Source

Scale AI and Princeton

How It Works

Dataset size

A set of high-difficulty software engineering tasks across multiple languages and frameworks, each requiring substantial code changes.

Mean score

35.6

Median score

45.0

Open / Closed

15 / 4

Top Scorers

#	Model	Lab	Source	Score
01	Gemini 3 Flash	Google	Closed	71.2
02	Kimi K2.6	Moonshot AI	Open	58.6
03	GLM-5.1	Z.ai	Open	58.4
04	GPT-5.4	OpenAI	Closed	57.7
05	DeepSeek-V4-Pro	DeepSeek	Open	55.4
06	minimax-m2.5	MiniMax	Open	55.4
07	Qwen3.6-27B	Alibaba	Open	53.5
08	Kimi K2.5	Moonshot AI	Open	50.7
09	Qwen3.6 35B-A3B	Alibaba	Open	49.5
10	Claude Opus 4.6	Anthropic	Closed	45.0
11	Kimi K2 Instruct	Moonshot AI	Open	27.7
12	Qwen3-235B-A22B	Alibaba	Open	21.4
13	DeepSeek-V3.2	DeepSeek	Open	15.6
14	Claude Haiku 4.5	Anthropic	Closed	14.0
15	Gemma 3 27B IT	Google	Open	11.4

Score Distribution

Open vs Closed Source

Gap on SWE-Pro:+12.6pts closed leads

Top Open-Source Models

1Kimi K2.658.6
2GLM-5.158.4
3DeepSeek-V4-Pro55.4

Top Closed-Source Models

1Gemini 3 Flash71.2
2GPT-5.457.7
3Claude Opus 4.645

Score vs Parameter Count

4 model(s) with undisclosed parameter counts not shown. Most closed-source labs do not publish model size.

Average Score by Lab

Moonshot AI
45.7n = 3
Alibaba
41.5n = 3
Google
41.3n = 2
DeepSeek
35.5n = 2
Z.ai
34.0n = 2
Anthropic
29.5n = 2
Meta
7.2n = 3

Most Correlated Benchmarks

GPQA
+0.87n = 13
Terminal Bench
+0.82n = 12
SWE-Verified
+0.74n = 11
Arena Score
+0.71n = 16
MMLU-PRO
+0.69n = 8
AIME 2026
+0.63n = 9
HLE
+0.29n = 12
Pearson r: −1 to +1. Positive means the two benchmarks rank models in similar order; negative means the opposite.

What It Captures Well

Closest public proxy for actual production engineering work.
Multi-language and multi-framework, so it reflects the real distribution of stacks.
Hidden tests make it harder to game without genuine engineering ability.

Where It Falls Short

Brand new in 2025, so the leaderboard has fewer entries and methodology is still settling.
Even more scaffold-dependent than SWE-Verified.
Expensive to run end-to-end, which limits how many times labs can re-evaluate.

Frequently Asked Questions

Why is SWE-Pro so much harder than SWE-Verified?

Is SWE-Pro the right benchmark for evaluating coding agents?

For senior-engineer-level autonomy, yes. For day-to-day code completion, no — pair it with a fast feedback benchmark like HumanEval++ or a real evaluation in your own repo.

Related Benchmarks

Based on score correlations across our database.

Pearson r +0.87

Picking the Right Model for Your Use Case?

We help product and engineering teams turn benchmark scores into shipped features. Free first conversation.

SWE-Pro: SWE-bench Pro

How It Works

Top Scorers

Score Distribution

Open vs Closed Source

Score vs Parameter Count

Average Score by Lab

Most Correlated Benchmarks

What It Captures Well

Where It Falls Short

Frequently Asked Questions

Related Benchmarks

GPQA

Terminal Bench

SWE-Verified

Arena Score

Picking the Right Model for Your Use Case?

SWE-Pro: SWE-bench Pro

How It Works

Top Scorers

Score Distribution

Open vs Closed Source

Score vs Parameter Count

Average Score by Lab

Most Correlated Benchmarks

What It Captures Well

Where It Falls Short

Frequently Asked Questions

Related Benchmarks

GPQA

Terminal Bench

SWE-Verified

Arena Score

Picking the Right Model for Your Use Case?