Benchmarks · 2024

SciCode: SciCode Scientific Code Generation Benchmark

Name: SciCode: SciCode Scientific Code Generation Benchmark
Creator: Carnegie Mellon, Princeton, and collaborators
Published: 2024
Keywords: SciCode, AI benchmark, text model evaluation, Carnegie Mellon, Princeton, and collaborators

Tests whether a model can write research code across physics, mathematics, biology, and chemistry.

Open Dataset Read Paper

Models Tested

144

Top Score

60.2

Published

2024

Source

Carnegie Mellon, Princeton, and collaborators

How It Works

SciCode is a code-generation benchmark built from real research problems. Each problem is broken into sub-problems that require the model to write functions that simulate physical systems, solve math problems, or process biological data. Unlike algorithmic puzzles, SciCode rewards domain knowledge plus implementation ability.

Each problem ships with reference solutions and unit tests. The model writes code; the harness runs it and scores pass-rate. Sub-problem scores are aggregated to a per-problem and per-domain score.

Dataset size

338 sub-problems drawn from 80 real scientific research problems across physics, math, biology, and chemistry.

Mean score

38.1

Median score

39.8

Open / Closed

61 / 83

Top Scorers

#	Model	Lab	Source	Score
01	Claude Fable 5	Anthropic	Closed	60.2
02	Gemini 3.1 Pro Preview	Google	Closed	58.9
03	GPT-5.4	OpenAI	Closed	56.6
04	GPT-5.4 High	OpenAI	Closed	56.6
05	GPT-5.6 Sol	OpenAI	Closed	56.1
06	GPT-5.5	OpenAI	Closed	56.1
07	Gemini 3 Pro	Google	Closed	56.1
08

Score Distribution

Open vs Closed Source

Gap on SciCode:+6.7pts closed leads

Top Open-Source Models

1Kimi K2.653.5
2GLM-5.250.5
3DeepSeek-V4-Pro50

Top Closed-Source Models

1Claude Fable 560.2
2Gemini 3.1 Pro Preview58.9
3GPT-5.456.6

Score vs Parameter Count

Average Score by Lab

Anthropic
45.4n = 17
Xiaomi
45.3n = 3
MiniMax
45.0n = 3
OpenAI
44.4n = 27

Most Correlated Benchmarks

GPQA
+0.91n = 144
MMLU-PRO
+0.91n = 95
AA Intelligence Index
+0.88n = 144
LiveCodeBench
+0.87n = 85
MATH-500

What It Captures Well

Tests applied scientific coding, not just generic algorithms.
Cross-domain breadth catches narrow specialization.
Strong predictor of how a model will help research workflows.

Where It Falls Short

Requires domain knowledge that smaller models lack.
Test coverage varies across sub-problems.
Python-only.

Related Benchmarks

Based on score correlations across our database.

Pearson r +0.91

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.