Competitive programming problems that demand real algorithmic reasoning, not just boilerplate code.
USACO uses problems from a real high-school programming olympiad. Each one needs a correct algorithm and an efficient implementation that runs inside tight time and memory limits. It measures hard algorithmic problem-solving, where a slow or almost-right solution still fails.
Generated code is run against the official test cases, including large stress inputs. A problem counts as solved only if the program passes every case within the limits. Scores report the percentage of problems fully solved.
| # | Agent System | Model | Score |
|---|---|---|---|
| 01 | USACO Episodic + Semantic | GPT-5 Medium (August 2025) | 69.1 |
| 02 | USACO Episodic + Semantic | o4-mini High (April 2025) | 64.8 |
| 03 | USACO Episodic + Semantic | o4-mini Low (April 2025) | 53.1 |
| 04 | USACO Episodic + Semantic | Claude Opus 4.1 High (August 2025) | 51.5 |
| 05 | USACO Episodic + Semantic | Claude Opus 4.1 (August 2025) | 48.2 |
| 06 | USACO Episodic + Semantic | o3 Medium (April 2025) | 46.3 |
| 07 | USACO Episodic + Semantic | GPT-4.1 (April 2025) | 45.0 |
| 08 | USACO Episodic + Semantic | DeepSeek V3 (March 2025) | 39.1 |
| 09 | USACO Episodic + Semantic | DeepSeek R1 (January 2025) | 38.1 |
| 10 | USACO Episodic + Semantic | Claude-3.7 Sonnet (February 2025) | 29.3 |
| 11 | USACO Episodic + Semantic | Gemini 2.0 Flash (February 2025) | 27.0 |
| 12 | USACO Episodic + Semantic | Claude-3.7 Sonnet High (February 2025) | 26.7 |
| 13 | HAL Generalist Agent | GPT-4.1 (April 2025) | 25.4 |
Most coding benchmarks accept any working solution. USACO also requires the solution to be efficient enough to pass large inputs, which is a separate and much harder skill.
Browse the other benchmarks on the leaderboard.