Contamination-resistant coding benchmark drawn from competition problems posted after each model’s training cutoff.
LiveCodeBench tests competition-style coding without contamination. Problems are sourced from public coding sites and stamped with their release date, so a model is only scored on problems posted after its training cutoff. The result is a clean test of how well a model writes algorithmic code from scratch.
Each problem has hidden unit tests. The model writes a solution, the harness runs it against the tests, and the score is pass-rate. Leaderboards bucket scores by time window so a model is never given credit for problems it could have memorized.
| # | Model | Lab | Source | Score |
|---|---|---|---|---|
| 01 | Gemini 3 Pro | Closed | 91.7 | |
| 02 | GLM-4.7 | Z.ai | Open | 89.4 |
| 03 | GPT-5.2 | OpenAI | Closed | 88.9 |
| 04 | GPT-5.1 | OpenAI | Closed | 86.8 |
| 05 | Kimi K2 Thinking | Moonshot AI | Open | 85.3 |
| 06 | Grok 4 Fast Reasoning | xAI | Closed | 83.2 |
| 07 | Grok 4.1 Fast Reasoning | xAI | Closed | 82.2 |
17 model(s) with undisclosed parameter counts not shown. Most closed-source labs do not publish model size.
HumanEval is small, old, and saturated. LiveCodeBench refreshes constantly with new competition problems and stamps each one with its release date so contamination is auditable.
Based on score correlations across our database.
| 08 | Gemini 2.5 Pro | Closed | 80.1 |
| 09 | Gemini 3 Flash | Closed | 79.7 |
| 10 | DeepSeek-R1 | DeepSeek | Open | 77.0 |
| 11 | Claude Opus 4.5 | Anthropic | Closed | 73.8 |
| 12 | Qwen3 Max Preview | Alibaba | Closed | 65.1 |
| 13 | Gemini 2.5 Flash Preview 09-2025 | Closed | 62.5 |
| 14 | DeepSeek-V3.2 | DeepSeek | Open | 59.3 |
| 15 | DeepSeek-V3.1 | DeepSeek | Open | 57.7 |