Agent Benchmark · 2024

USACO: USA Computing Olympiad

Name: USACO: USA Computing Olympiad
Creator: Princeton
Published: 2024
Keywords: USACO, AI agent benchmark, Princeton

Competitive programming problems that demand real algorithmic reasoning, not just boilerplate code.

View Dataset Read Paper

Systems Ranked

Top Score

69.1

Published

2024

Source

Princeton

How It Works

USACO uses problems from a real high-school programming olympiad. Each one needs a correct algorithm and an efficient implementation that runs inside tight time and memory limits. It measures hard algorithmic problem-solving, where a slow or almost-right solution still fails.

Generated code is run against the official test cases, including large stress inputs. A problem counts as solved only if the program passes every case within the limits. Scores report the percentage of problems fully solved.

Dataset size

307 problems from the USA Computing Olympiad.

Agent type

Coding

Published by

Princeton

Year

2024

Top Agent Systems

#	Agent System	Model	Score
01	USACO Episodic + Semantic	GPT-5 Medium (August 2025)	69.1
02	USACO Episodic + Semantic	o4-mini High (April 2025)	64.8
03	USACO Episodic + Semantic	o4-mini Low (April 2025)	53.1
04	USACO Episodic + Semantic	Claude Opus 4.1 High (August 2025)	51.5
05	USACO Episodic + Semantic	Claude Opus 4.1 (August 2025)	48.2
06	USACO Episodic + Semantic	o3 Medium (April 2025)	46.3
07	USACO Episodic + Semantic	GPT-4.1 (April 2025)	45.0
08	USACO Episodic + Semantic	DeepSeek V3 (March 2025)	39.1
09	USACO Episodic + Semantic	DeepSeek R1 (January 2025)	38.1
10	USACO Episodic + Semantic	Claude-3.7 Sonnet (February 2025)	29.3
11	USACO Episodic + Semantic	Gemini 2.0 Flash (February 2025)	27.0
12	USACO Episodic + Semantic	Claude-3.7 Sonnet High (February 2025)	26.7
13	HAL Generalist Agent	GPT-4.1 (April 2025)	25.4

Strengths

Tests genuine algorithm design, not pattern recall.
Strict time limits punish brute-force shortcuts.
Objective, fully automated grading.

Limitations

Olympiad style is far from everyday software work.
All-or-nothing per problem makes scores swingy.
Strong contest solutions may appear in training data.

Frequently Asked Questions

Why is USACO so much harder than typical coding benchmarks?

Most coding benchmarks accept any working solution. USACO also requires the solution to be efficient enough to pass large inputs, which is a separate and much harder skill.

Other Agent Benchmarks

Browse the other benchmarks on the leaderboard.

Coding

SWE-bench Verified

Real GitHub issues an agent has to fix by editing a codebase until the project test suite passes.

Coding

SWE-bench Verified Mini

A smaller, cheaper slice of SWE-bench Verified used to compare coding agents without a huge compute bill.

Coding

SciCode

Research-grade scientific coding problems drawn from real physics, biology, and chemistry work.

Research

GAIA

Real-world assistant questions that need web browsing, tool use, and multi-step reasoning to answer correctly.

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

Agent Benchmark · 2024

USACO: USA Computing Olympiad

Competitive programming problems that demand real algorithmic reasoning, not just boilerplate code.

View Dataset Read Paper

Systems Ranked

Top Score

69.1

Published

2024

Source

Princeton

How It Works

Dataset size

307 problems from the USA Computing Olympiad.

Agent type

Coding

Published by

Princeton

Year

2024

Top Agent Systems

#	Agent System	Model	Score
01	USACO Episodic + Semantic	GPT-5 Medium (August 2025)	69.1
02	USACO Episodic + Semantic	o4-mini High (April 2025)	64.8
03	USACO Episodic + Semantic	o4-mini Low (April 2025)	53.1
04	USACO Episodic + Semantic	Claude Opus 4.1 High (August 2025)	51.5
05	USACO Episodic + Semantic	Claude Opus 4.1 (August 2025)	48.2
06	USACO Episodic + Semantic	o3 Medium (April 2025)	46.3
07	USACO Episodic + Semantic	GPT-4.1 (April 2025)	45.0
08	USACO Episodic + Semantic	DeepSeek V3 (March 2025)	39.1
09	USACO Episodic + Semantic	DeepSeek R1 (January 2025)	38.1
10	USACO Episodic + Semantic	Claude-3.7 Sonnet (February 2025)	29.3
11	USACO Episodic + Semantic	Gemini 2.0 Flash (February 2025)	27.0
12	USACO Episodic + Semantic	Claude-3.7 Sonnet High (February 2025)	26.7
13	HAL Generalist Agent	GPT-4.1 (April 2025)	25.4

Strengths

Tests genuine algorithm design, not pattern recall.
Strict time limits punish brute-force shortcuts.
Objective, fully automated grading.

Limitations

Olympiad style is far from everyday software work.
All-or-nothing per problem makes scores swingy.
Strong contest solutions may appear in training data.

Frequently Asked Questions

Why is USACO so much harder than typical coding benchmarks?

Most coding benchmarks accept any working solution. USACO also requires the solution to be efficient enough to pass large inputs, which is a separate and much harder skill.

Other Agent Benchmarks

Browse the other benchmarks on the leaderboard.

Coding

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.