Benchmarks · 2024

LiveCodeBench: LiveCodeBench Competition Programming Benchmark

Name: LiveCodeBench: LiveCodeBench Competition Programming Benchmark
Creator: LiveCodeBench Team (UC Berkeley, MIT, Cornell)
Published: 2024
Keywords: LiveCodeBench, AI benchmark, text model evaluation, LiveCodeBench Team (UC Berkeley, MIT, Cornell)

Contamination-resistant coding benchmark drawn from competition problems posted after each model’s training cutoff.

Open Dataset Read Paper

Models Tested

Top Score

91.7

Published

2024

Source

LiveCodeBench Team (UC Berkeley, MIT, Cornell)

How It Works

LiveCodeBench tests competition-style coding without contamination. Problems are sourced from public coding sites and stamped with their release date, so a model is only scored on problems posted after its training cutoff. The result is a clean test of how well a model writes algorithmic code from scratch.

Each problem has hidden unit tests. The model writes a solution, the harness runs it against the tests, and the score is pass-rate. Leaderboards bucket scores by time window so a model is never given credit for problems it could have memorized.

Dataset size

Hundreds of competition problems sourced from LeetCode, AtCoder, and Codeforces, refreshed continuously.

Mean score

52.5

Median score

55.0

Open / Closed

32 / 54

Top Scorers

#	Model	Lab	Source	Score
01	Gemini 3 Pro	Google	Closed	91.7
02	Gemini 3 Flash (Thinking Minimal)	Google	Closed	90.8
03	GLM-4.7	Z.ai	Open	89.4
04	GPT-5.2	OpenAI	Closed	88.9
05	GPT-5.2 High	OpenAI	Closed	88.9
06	Claude Opus 4.5 (Thinking 32K)	Anthropic	Closed	87.1
07	GPT-5.1	OpenAI	Closed	86.8
08	GPT-5.1 High	OpenAI	Closed	86.8
09	OpenAI o4-mini	OpenAI	Closed	85.9
10	Kimi K2 Thinking	Moonshot AI	Open	85.3
11	GPT-5 High	OpenAI	Closed	84.6
12	GPT-5 Mini High	OpenAI	Closed	83.8
13	Grok 4 Fast Reasoning	xAI	Closed	83.2
14	Grok 4.1 Fast Reasoning	xAI	Closed	82.2
15	Grok 4 (0709)	xAI	Closed	81.9

Score Distribution

Open vs Closed Source

Gap on LiveCodeBench:+2.3pts closed leads

Top Open-Source Models

1GLM-4.789.4
2Kimi K2 Thinking85.3
3DeepSeek-R177

Top Closed-Source Models

1Gemini 3 Pro91.7
2Gemini 3 Flash (Thinking Minimal)90.8
3GPT-5.288.9

Score vs Parameter Count

54 model(s) with undisclosed parameter counts not shown. Most closed-source labs do not publish model size.

Average Score by Lab

Z.ai
73.1n = 3
OpenAI
69.7n = 19
Moonshot AI
67.3n = 3
Anthropic
62.1n = 11
xAI
60.9n = 7
DeepSeek
57.5n = 4
Google
54.5n = 12
Alibaba
45.5n = 6
Amazon
41.0n = 2
Mistral
40.3n = 2

Most Correlated Benchmarks

GPQA
+0.95n = 86
AA Intelligence Index
+0.89n = 86
AA LCR
+0.88n = 75
SciCode
+0.87n = 85
IFBench
+0.84n = 75
MMLU-PRO
+0.82n = 85
MATH-500
+0.79n = 52
Terminal Bench Hard
+0.78n = 72
Arena Score
+0.77n = 85
HLE
+0.77n = 86
Pearson r: −1 to +1. Positive means the two benchmarks rank models in similar order; negative means the opposite.

What It Captures Well

Contamination resistance baked into the methodology.
Real competition problems with rigorous test coverage.
Updated continuously, so scores stay relevant.

Where It Falls Short

Algorithmic puzzles, not production engineering: high scores do not guarantee real-world coding ability.
Sensitive to prompt format and language choice.
Hidden tests can be brittle on edge cases.

Frequently Asked Questions

How does LiveCodeBench differ from HumanEval?

HumanEval is small, old, and saturated. LiveCodeBench refreshes constantly with new competition problems and stamps each one with its release date so contamination is auditable.

Related Benchmarks

Based on score correlations across our database.

Pearson r +0.95

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

Benchmarks · 2024

LiveCodeBench: LiveCodeBench Competition Programming Benchmark

Contamination-resistant coding benchmark drawn from competition problems posted after each model’s training cutoff.

Open Dataset Read Paper

Models Tested

Top Score

91.7

Published

2024

Source

LiveCodeBench Team (UC Berkeley, MIT, Cornell)

How It Works

Dataset size

Hundreds of competition problems sourced from LeetCode, AtCoder, and Codeforces, refreshed continuously.

Mean score

52.5

Median score

55.0

Open / Closed

32 / 54

Top Scorers

#	Model	Lab	Source	Score
01	Gemini 3 Pro	Google	Closed	91.7
02	Gemini 3 Flash (Thinking Minimal)	Google	Closed	90.8
03	GLM-4.7	Z.ai	Open	89.4
04	GPT-5.2	OpenAI	Closed	88.9
05	GPT-5.2 High	OpenAI	Closed	88.9
06	Claude Opus 4.5 (Thinking 32K)	Anthropic	Closed	87.1
07	GPT-5.1	OpenAI	Closed	86.8
08	GPT-5.1 High	OpenAI	Closed	86.8
09	OpenAI o4-mini	OpenAI	Closed	85.9
10	Kimi K2 Thinking	Moonshot AI	Open	85.3
11	GPT-5 High	OpenAI	Closed	84.6
12	GPT-5 Mini High	OpenAI	Closed	83.8
13	Grok 4 Fast Reasoning	xAI	Closed	83.2
14	Grok 4.1 Fast Reasoning	xAI	Closed	82.2
15	Grok 4 (0709)	xAI	Closed	81.9

Score Distribution

Open vs Closed Source

Gap on LiveCodeBench:+2.3pts closed leads

Top Open-Source Models

1GLM-4.789.4
2Kimi K2 Thinking85.3
3DeepSeek-R177

Top Closed-Source Models

1Gemini 3 Pro91.7
2Gemini 3 Flash (Thinking Minimal)90.8
3GPT-5.288.9

Score vs Parameter Count

54 model(s) with undisclosed parameter counts not shown. Most closed-source labs do not publish model size.

Average Score by Lab

Z.ai
73.1n = 3
OpenAI
69.7n = 19
Moonshot AI
67.3n = 3
Anthropic
62.1n = 11
xAI
60.9n = 7
DeepSeek
57.5n = 4
Google
54.5n = 12
Alibaba
45.5n = 6
Amazon
41.0n = 2
Mistral
40.3n = 2

Most Correlated Benchmarks

GPQA
+0.95n = 86
AA Intelligence Index
+0.89n = 86
AA LCR
+0.88n = 75
SciCode
+0.87n = 85
IFBench
+0.84n = 75
MMLU-PRO
+0.82n = 85
MATH-500
+0.79n = 52
Terminal Bench Hard
+0.78n = 72
Arena Score
+0.77n = 85
HLE
+0.77n = 86
Pearson r: −1 to +1. Positive means the two benchmarks rank models in similar order; negative means the opposite.

What It Captures Well

Contamination resistance baked into the methodology.
Real competition problems with rigorous test coverage.
Updated continuously, so scores stay relevant.

Where It Falls Short

Algorithmic puzzles, not production engineering: high scores do not guarantee real-world coding ability.
Sensitive to prompt format and language choice.
Hidden tests can be brittle on edge cases.

Frequently Asked Questions

How does LiveCodeBench differ from HumanEval?

HumanEval is small, old, and saturated. LiveCodeBench refreshes constantly with new competition problems and stamps each one with its release date so contamination is auditable.

Related Benchmarks

Based on score correlations across our database.

Pearson r +0.95

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

LiveCodeBench: LiveCodeBench Competition Programming Benchmark

How It Works

Top Scorers

Score Distribution

Open vs Closed Source

Score vs Parameter Count

Average Score by Lab

Most Correlated Benchmarks

What It Captures Well

Where It Falls Short

Frequently Asked Questions

Related Benchmarks

GPQA

AA Intelligence Index

AA LCR

SciCode

The AI Build Report

LiveCodeBench: LiveCodeBench Competition Programming Benchmark

How It Works

Top Scorers

Score Distribution

Open vs Closed Source

Score vs Parameter Count

Average Score by Lab

Most Correlated Benchmarks

What It Captures Well

Where It Falls Short

Frequently Asked Questions

Related Benchmarks

GPQA

AA Intelligence Index

AA LCR

SciCode

The AI Build Report