Tests whether a model can write research code across physics, mathematics, biology, and chemistry.
SciCode is a code-generation benchmark built from real research problems. Each problem is broken into sub-problems that require the model to write functions that simulate physical systems, solve math problems, or process biological data. Unlike algorithmic puzzles, SciCode rewards domain knowledge plus implementation ability.
Each problem ships with reference solutions and unit tests. The model writes code; the harness runs it and scores pass-rate. Sub-problem scores are aggregated to a per-problem and per-domain score.
| # | Model | Lab | Source | Score |
|---|---|---|---|---|
| 01 | Gemini 3.1 Pro Preview | Closed | 58.9 | |
| 02 | GPT-5.4 | OpenAI | Closed | 56.6 |
| 03 | GPT-5.5 | OpenAI | Closed | 56.1 |
| 04 | Gemini 3 Pro | Closed | 56.1 | |
| 05 | Claude Opus 4.7 | Anthropic | Closed | 54.5 |
| 06 | Kimi K2.6 | Moonshot AI | Open | 53.5 |
| 07 | Gemini 3.5 Flash | Closed | 53.1 | |
| 08 |
32 model(s) with undisclosed parameter counts not shown. Most closed-source labs do not publish model size.
Based on score correlations across our database.
| GPT-5.2 |
| OpenAI |
| Closed |
| 52.1 |
| 09 | MiMo-V2.5-Pro | Xiaomi | Closed | 50.2 |
| 10 | DeepSeek-V4-Pro | DeepSeek | Open | 50.0 |
| 11 | Gemini 3 Flash | Closed | 49.9 |
| 12 | Kimi K2.5 | Moonshot AI | Open | 49.0 |
| 13 | Grok 4.3 | xAI | Closed | 47.3 |
| 14 | MiniMax M2.7 | MiniMax | Closed | 47.0 |
| 15 | Claude Opus 4.5 | Anthropic | Closed | 47.0 |