Agent Benchmark · 2024

SWE-bench Verified: SWE-bench Verified

Name: SWE-bench Verified: SWE-bench Verified
Creator: Princeton and OpenAI
Published: 2024
Keywords: SWE-bench Verified, AI agent benchmark, Princeton and OpenAI

Real GitHub issues an agent has to fix by editing a codebase until the project test suite passes.

View Dataset Read Paper

Systems Ranked

Top Score

—

Published

2024

Source

Princeton and OpenAI

How It Works

SWE-bench Verified gives an agent a real bug report from a popular open-source Python project and the repository it lives in. The agent has to find the right files, write a patch, and make the hidden tests pass. It is the closest public test to "can this agent do a junior engineer's ticket end to end."

Each task ships with the repository state, the issue text, and a hidden set of tests. A run counts as resolved only if the agent's patch makes the failing tests pass without breaking the passing ones. Scores are the percentage of the 500 tasks resolved.

Dataset size

500 human-validated real-world bug-fix tasks.

Agent type

Coding

Published by

Princeton and OpenAI

Year

2024

Top Agent Systems

No scored systems for this benchmark yet. Check back after the next weekly sync.

Strengths

Pass/fail on real tests, so there is no partial credit to argue about.
The Verified subset was human-checked, removing broken or impossible tasks.
Directly tied to work software teams actually pay for.

Limitations

Python only, so it does not speak to other language ecosystems.
Heavy on a handful of large libraries, which can favour familiar repos.
Expensive to run, so some labs report on smaller slices.

Frequently Asked Questions

How is this different from the SWE-bench leaderboard scores I see for models?

The same tasks are used, but the score depends heavily on the agent scaffold around the model: how it searches the repo, runs tests, and retries. Two systems on the same model can land 20 points apart, which is exactly what this leaderboard surfaces.

What is a good SWE-bench Verified score?

Top agent systems in 2026 resolve roughly 65–75% of tasks. A year earlier the best were near 50%, so this number moves fast.

Other Agent Benchmarks

Browse the other benchmarks on the leaderboard.

Coding

SWE-bench Verified Mini

A smaller, cheaper slice of SWE-bench Verified used to compare coding agents without a huge compute bill.

Coding

SciCode

Research-grade scientific coding problems drawn from real physics, biology, and chemistry work.

Coding

USACO

Competitive programming problems that demand real algorithmic reasoning, not just boilerplate code.

Research

GAIA

Real-world assistant questions that need web browsing, tool use, and multi-step reasoning to answer correctly.

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

Agent Benchmark · 2024

SWE-bench Verified: SWE-bench Verified

Real GitHub issues an agent has to fix by editing a codebase until the project test suite passes.

View Dataset Read Paper

Systems Ranked

Top Score

—

Published

2024

Source

Princeton and OpenAI

How It Works

Dataset size

500 human-validated real-world bug-fix tasks.

Agent type

Coding

Published by

Princeton and OpenAI

Year

2024

Top Agent Systems

No scored systems for this benchmark yet. Check back after the next weekly sync.

Strengths

Pass/fail on real tests, so there is no partial credit to argue about.
The Verified subset was human-checked, removing broken or impossible tasks.
Directly tied to work software teams actually pay for.

Limitations

Python only, so it does not speak to other language ecosystems.
Heavy on a handful of large libraries, which can favour familiar repos.
Expensive to run, so some labs report on smaller slices.

Frequently Asked Questions

How is this different from the SWE-bench leaderboard scores I see for models?

What is a good SWE-bench Verified score?

Top agent systems in 2026 resolve roughly 65–75% of tasks. A year earlier the best were near 50%, so this number moves fast.

Other Agent Benchmarks

Browse the other benchmarks on the leaderboard.

Coding

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.