Benchmarks · 2024

SWE-Verified: SWE-bench Verified

Name: SWE-Verified: SWE-bench Verified
Creator: OpenAI and Princeton
Published: 2024
Keywords: SWE-Verified, AI benchmark, text model evaluation, OpenAI and Princeton

Five hundred real GitHub issues, hand-checked by engineers, that test whether a model can ship a working code change.

Open Dataset Read Paper

Models Tested

Top Score

80.8

Published

2024

Source

OpenAI and Princeton

How It Works

SWE-Verified is the closest thing the field has to a real-world software engineering test. Each task gives the model a repository, an open issue, and the project test suite. The model has to produce a code patch that resolves the issue and passes the hidden tests. Humans verified that each issue is well-specified and solvable, so a failure points at the model, not at a broken benchmark.

A model agent is given the repo, the issue, and a sandboxed shell. It can read files, run commands, and edit code. The patch is scored pass or fail based on the project test suite, including hidden regression tests written by the original maintainers. Score is the fraction of issues resolved correctly.

Dataset size

500 human-validated issues drawn from 12 popular Python repositories.

Mean score

73.3

Median score

73.8

Open / Closed

17 / 8

Top Scorers

#	Model	Lab	Source	Score
01	Claude Opus 4.6	Anthropic	Closed	80.8
02	DeepSeek-V4-Pro	DeepSeek	Open	80.6
03	Kimi K2.6	Moonshot AI	Open	80.2
04	GPT-5.2	OpenAI	Closed	80.0
05	Claude Sonnet 4.6	Anthropic	Closed	79.6
06	DeepSeek-V4-Flash	DeepSeek	Open	79.0
07	Qwen3.6-27B

Score Distribution

Open vs Closed Source

Gap on SWE-Verified:+0.2pts closed leads

Top Open-Source Models

1DeepSeek-V4-Pro80.6
2Kimi K2.680.2
3DeepSeek-V4-Flash79

Top Closed-Source Models

1Claude Opus 4.680.8
2GPT-5.280
3Claude Sonnet 4.6

Score vs Parameter Count

8 model(s) with undisclosed parameter counts not shown. Most closed-source labs do not publish model size.

Average Score by Lab

OpenAI
77.5n = 2
DeepSeek
76.5n = 3
Anthropic
76.4n = 4
Google
74.3n = 2
Moonshot AI
74.1n = 3

Most Correlated Benchmarks

GPQA
+0.82n = 25
Terminal Bench
+0.76n = 21
SWE-Pro
+0.74n = 11
Arena Score
+0.73n = 21
HLE

What It Captures Well

Tests end-to-end software engineering, not just code completion.
Real repos with real tests — failure modes look like real engineering bugs.
Resistant to memorization because the underlying repos keep evolving.

Where It Falls Short

Python-only, so it under-represents JavaScript, Go, Rust, and systems engineering.
Requires a capable agent harness — the same model scores very differently in different scaffolds.
Sensitive to compute budget; many runs use 100+ tool calls per issue.

Frequently Asked Questions

What is a good SWE-Verified score?

Frontier closed models with strong agent scaffolds are above 70% in 2026. Strong open-weight models in similar scaffolds are in the 40–60% range. Models below 20% are not yet useful as autonomous coding agents on this kind of task.

Is the score the model or the agent?

Both. SWE-Verified is famously sensitive to scaffolding — the same base model can swing 20 points based on the agent loop, tools, and retry strategy. Compare scores within the same scaffold whenever possible.

How does SWE-Verified differ from SWE-Pro?

SWE-Verified is human-curated open-source Python issues. SWE-Pro is longer, harder, enterprise-style tasks that demand more planning. SWE-Pro scores are generally much lower than SWE-Verified for the same model.

Related Benchmarks

Based on score correlations across our database.

Pearson r +0.82

GPQA

n = 25

Picking the Right Model for Your Use Case?

We help product and engineering teams turn benchmark scores into shipped features. Free first conversation.