Benchmarks · 2024

AA Intelligence Index: Artificial Analysis Intelligence Index

Name: AA Intelligence Index: Artificial Analysis Intelligence Index
Creator: Artificial Analysis
Published: 2024
Keywords: AA Intelligence Index, AI benchmark, text model evaluation, Artificial Analysis

Artificial Analysis composite score that blends a dozen reasoning, coding, and math benchmarks into a single number.

Open Dataset

Models Tested

146

Top Score

59.9

Published

2024

Source

Artificial Analysis

How It Works

The Intelligence Index is Artificial Analysis’s way of giving every LLM a single comparable score across reasoning, coding, math, and long-context ability. They run a dozen public benchmarks under the same prompt and scoring settings, normalize each one to a 0–100 range, and average them with fixed weights. It is most useful as a quick way to spot the top tier without studying every sub-benchmark.

Artificial Analysis runs each model on the underlying evaluations using a shared prompting harness, normalizes each per-benchmark score, and aggregates them with published weights. The composite is recomputed whenever a new model lands or a benchmark is added.

Dataset size

Weighted aggregate over the Artificial Analysis evaluation suite, including MMLU-Pro, GPQA Diamond, HLE, MATH-500, LiveCodeBench, SciCode, IFBench, and AA-LCR.

Mean score

25.5

Median score

25.3

Open / Closed

63 / 83

Top Scorers

#	Model	Lab	Source	Score
01	Claude Fable 5	Anthropic	Closed	59.9
02	GPT-5.5	OpenAI	Closed	54.8
03	Grok 4.5	xAI	Closed	53.8
04	Claude Opus 4.7 Thinking	Anthropic	Closed	53.5
05	Claude Opus 4.7	Anthropic	Closed	53.5
06	GPT-5.4	OpenAI	Closed	51.4
07	GPT-5.4 High	OpenAI	Closed	51.4
08	GLM-5.2	Z.ai	Open	51.1
09	Gemini 3.5 Flash	Google	Closed	50.2
10	Gemini 3.1 Pro Preview	Google	Closed	46.5
11	Qwen 3.7 Max	Alibaba	Closed	46.0
12	MiniMax M3	MiniMax	Open	44.4
13	DeepSeek-V4-Pro	DeepSeek	Open	44.3
14	Kimi K2.6	Moonshot AI	Open	44.2
15	Claude Opus 4.6 (Thinking)	Anthropic	Closed	43.7

Score Distribution

Open vs Closed Source

Gap on AA Intelligence Index:+8.8pts closed leads

Top Open-Source Models

1GLM-5.251.1
2MiniMax M344.4
3DeepSeek-V4-Pro44.3

Top Closed-Source Models

1Claude Fable 559.9
2GPT-5.554.8
3Grok 4.553.8

Score vs Parameter Count

81 model(s) with undisclosed parameter counts not shown. Most closed-source labs do not publish model size.

Average Score by Lab

Xiaomi
40.9n = 3
MiniMax
38.7n = 3
Anthropic
36.1n = 18
Z.ai
34.5n = 6
Moonshot AI
33.3n = 6
xAI
29.3n = 11
OpenAI
28.1n = 26
DeepSeek
27.4n = 6
Alibaba
26.5n = 16
Google
20.7n = 21

Most Correlated Benchmarks

Terminal Bench Hard
+0.96n = 126
HLE
+0.92n = 142
LiveCodeBench
+0.89n = 86
GPQA
+0.89n = 142
SciCode
+0.88n = 141
Terminal Bench
+0.86n = 17
AA LCR
+0.85n = 131
SWE-Pro
+0.85n = 14
IFBench
+0.76n = 130
MMLU-PRO
+0.76n = 96
Arena Score
+0.75n = 119
AIME 2026
+0.74n = 10
MATH-500
+0.71n = 53
SWE-Verified
+0.69n = 19
HMMT 2026
+0.47n = 9
Pearson r: −1 to +1. Positive means the two benchmarks rank models in similar order; negative means the opposite.

What It Captures Well

A single comparable number across most modern LLMs.
Run consistently by one team using one harness, so cross-model comparisons are fair.
Updated continuously as benchmarks and models change.

Where It Falls Short

Composite scores hide which specific skills a model is good or bad at.
Weighting choices are opinionated and can favor certain skill profiles.
Does not capture agent ability or production reliability.

Frequently Asked Questions

How is this different from our averageScore?

Our average pulls in every benchmark we track, including Arena scores and HF leaderboards. The AA Intelligence Index is Artificial Analysis’s own composite over their evaluation suite. Use it as a second opinion, not a replacement.

Can I see the underlying scores?

Yes. Every benchmark inside the Intelligence Index has its own deep-dive page, so you can drill into MMLU-Pro, GPQA Diamond, LiveCodeBench, and the others independently.

Related Benchmarks

Based on score correlations across our database.

Pearson r +0.96

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

Benchmarks · 2024

AA Intelligence Index: Artificial Analysis Intelligence Index

Artificial Analysis composite score that blends a dozen reasoning, coding, and math benchmarks into a single number.

Open Dataset

Models Tested

146

Top Score

59.9

Published

2024

Source

Artificial Analysis

How It Works

Dataset size

Weighted aggregate over the Artificial Analysis evaluation suite, including MMLU-Pro, GPQA Diamond, HLE, MATH-500, LiveCodeBench, SciCode, IFBench, and AA-LCR.

Mean score

25.5

Median score

25.3

Open / Closed

63 / 83

Top Scorers

#	Model	Lab	Source	Score
01	Claude Fable 5	Anthropic	Closed	59.9
02	GPT-5.5	OpenAI	Closed	54.8
03	Grok 4.5	xAI	Closed	53.8
04	Claude Opus 4.7 Thinking	Anthropic	Closed	53.5
05	Claude Opus 4.7	Anthropic	Closed	53.5
06	GPT-5.4	OpenAI	Closed	51.4
07	GPT-5.4 High	OpenAI	Closed	51.4
08	GLM-5.2	Z.ai	Open	51.1
09	Gemini 3.5 Flash	Google	Closed	50.2
10	Gemini 3.1 Pro Preview	Google	Closed	46.5
11	Qwen 3.7 Max	Alibaba	Closed	46.0
12	MiniMax M3	MiniMax	Open	44.4
13	DeepSeek-V4-Pro	DeepSeek	Open	44.3
14	Kimi K2.6	Moonshot AI	Open	44.2
15	Claude Opus 4.6 (Thinking)	Anthropic	Closed	43.7

Score Distribution

Open vs Closed Source

Gap on AA Intelligence Index:+8.8pts closed leads

Top Open-Source Models

1GLM-5.251.1
2MiniMax M344.4
3DeepSeek-V4-Pro44.3

Top Closed-Source Models

1Claude Fable 559.9
2GPT-5.554.8
3Grok 4.553.8

Score vs Parameter Count

81 model(s) with undisclosed parameter counts not shown. Most closed-source labs do not publish model size.

Average Score by Lab

Xiaomi
40.9n = 3
MiniMax
38.7n = 3
Anthropic
36.1n = 18
Z.ai
34.5n = 6
Moonshot AI
33.3n = 6
xAI
29.3n = 11
OpenAI
28.1n = 26
DeepSeek
27.4n = 6
Alibaba
26.5n = 16
Google
20.7n = 21

Most Correlated Benchmarks

Terminal Bench Hard
+0.96n = 126
HLE
+0.92n = 142
LiveCodeBench
+0.89n = 86
GPQA
+0.89n = 142
SciCode
+0.88n = 141
Terminal Bench
+0.86n = 17
AA LCR
+0.85n = 131
SWE-Pro
+0.85n = 14
IFBench
+0.76n = 130
MMLU-PRO
+0.76n = 96
Arena Score
+0.75n = 119
AIME 2026
+0.74n = 10
MATH-500
+0.71n = 53
SWE-Verified
+0.69n = 19
HMMT 2026
+0.47n = 9
Pearson r: −1 to +1. Positive means the two benchmarks rank models in similar order; negative means the opposite.

What It Captures Well

A single comparable number across most modern LLMs.
Run consistently by one team using one harness, so cross-model comparisons are fair.
Updated continuously as benchmarks and models change.

Where It Falls Short

Composite scores hide which specific skills a model is good or bad at.
Weighting choices are opinionated and can favor certain skill profiles.
Does not capture agent ability or production reliability.

Frequently Asked Questions

How is this different from our averageScore?

Can I see the underlying scores?

Yes. Every benchmark inside the Intelligence Index has its own deep-dive page, so you can drill into MMLU-Pro, GPQA Diamond, LiveCodeBench, and the others independently.

Related Benchmarks

Based on score correlations across our database.

Pearson r +0.96

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

AA Intelligence Index: Artificial Analysis Intelligence Index

How It Works

Top Scorers

Score Distribution

Open vs Closed Source

Score vs Parameter Count

Average Score by Lab

Most Correlated Benchmarks

What It Captures Well

Where It Falls Short

Frequently Asked Questions

Related Benchmarks

Terminal Bench Hard

HLE

LiveCodeBench

GPQA

The AI Build Report

AA Intelligence Index: Artificial Analysis Intelligence Index

How It Works

Top Scorers

Score Distribution

Open vs Closed Source

Score vs Parameter Count

Average Score by Lab

Most Correlated Benchmarks

What It Captures Well

Where It Falls Short

Frequently Asked Questions

Related Benchmarks

Terminal Bench Hard

HLE

LiveCodeBench

GPQA

The AI Build Report