Benchmarks · 2024

AA LCR: Artificial Analysis Long Context Reasoning

Name: AA LCR: Artificial Analysis Long Context Reasoning
Creator: Artificial Analysis
Published: 2024
Keywords: AA LCR, AI benchmark, text model evaluation, Artificial Analysis

Tests reasoning over inputs from 10,000 to 100,000 tokens, well past what shorter benchmarks measure.

Open Dataset

Models Tested

131

Top Score

75.6

Published

2024

Source

Artificial Analysis

How It Works

Most long-context benchmarks only test whether a model can retrieve a specific fact from a long input ("needle in a haystack"). AA LCR goes further: it tests reasoning that requires synthesizing information spread across the entire long context. Scores at the longest tiers separate models that genuinely use their context window from models that only claim to.

Models receive a long input plus a question that requires reasoning across multiple sections. Scores are reported per length tier so users can see where each model breaks down.

Dataset size

Long-context reasoning tasks at multiple length tiers between 10K and 100K input tokens.

Mean score

49.6

Median score

59.0

Open / Closed

57 / 74

Top Scorers

#	Model	Lab	Source	Score
01	GPT-5 High	OpenAI	Closed	75.6
02	GPT-5.1	OpenAI	Closed	75.0
03	GPT-5.1 High	OpenAI	Closed	75.0
04	GPT-5.5	OpenAI	Closed	74.3
05	MiniMax M3	MiniMax	Open	74.0
06	Claude Opus 4.5 (Thinking 32K)	Anthropic	Closed	74.0
07	GPT-5.4	OpenAI	Closed	74.0
08	GPT-5.4 High	OpenAI	Closed	74.0
09	MiMo-V2.5-Pro	Xiaomi	Closed	73.3
10	GPT-5.2	OpenAI	Closed	72.7
11	GPT-5.2 High	OpenAI	Closed	72.7
12	Gemini 3.1 Pro Preview	Google	Closed	72.7
13	GLM-5.2	Z.ai	Open	71.3
14	Gemini 3 Pro	Google	Closed	70.7
15	Claude Opus 4.6 (Thinking)	Anthropic	Closed	70.7

Score Distribution

Open vs Closed Source

Gap on AA LCR:+1.6pts closed leads

Top Open-Source Models

1MiniMax M374
2GLM-5.271.3
3Kimi K2.669.7

Top Closed-Source Models

1GPT-5 High75.6
2GPT-5.175
3GPT-5.1 High75

Score vs Parameter Count

72 model(s) with undisclosed parameter counts not shown. Most closed-source labs do not publish model size.

Average Score by Lab

MiniMax
69.6n = 3
Xiaomi
65.6n = 3
Moonshot AI
61.8n = 6
OpenAI
60.3n = 21
Anthropic
59.0n = 17
xAI
58.1n = 10
Z.ai
55.9n = 6
Alibaba
49.9n = 15
DeepSeek
49.5n = 6
NVIDIA
47.8n = 2

Most Correlated Benchmarks

GPQA
+0.90n = 131
LiveCodeBench
+0.88n = 75
AA Intelligence Index
+0.85n = 131
SWE-Pro
+0.84n = 14
SciCode
+0.81n = 131
Terminal Bench Hard
+0.80n = 125
MMLU-PRO
+0.76n = 86
IFBench
+0.76n = 130
HLE
+0.74n = 131
Arena Score
+0.71n = 105
MATH-500
+0.68n = 41
Terminal Bench
+0.60n = 17
HMMT 2026
+0.38n = 9
SWE-Verified
+0.37n = 19
AIME 2026
+0.31n = 10
Pearson r: −1 to +1. Positive means the two benchmarks rank models in similar order; negative means the opposite.

What It Captures Well

Stresses reasoning, not just retrieval.
Length-tier breakdowns show where models start to fail.
Comparable across labs using the same harness.

Where It Falls Short

Closed methodology compared to academic benchmarks.
Sensitive to chunking and prompt structure.
Different from RAG benchmarks, which test retrieval pipelines.

Frequently Asked Questions

Is this the same as Needle-in-a-Haystack?

No. Needle-in-a-Haystack only tests recall of a single planted fact. AA LCR requires reasoning over information spread throughout the long context, which is much harder.

Related Benchmarks

Based on score correlations across our database.

Pearson r +0.90

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

Benchmarks · 2024

AA LCR: Artificial Analysis Long Context Reasoning

Tests reasoning over inputs from 10,000 to 100,000 tokens, well past what shorter benchmarks measure.

Open Dataset

Models Tested

131

Top Score

75.6

Published

2024

Source

Artificial Analysis

How It Works

Models receive a long input plus a question that requires reasoning across multiple sections. Scores are reported per length tier so users can see where each model breaks down.

Dataset size

Long-context reasoning tasks at multiple length tiers between 10K and 100K input tokens.

Mean score

49.6

Median score

59.0

Open / Closed

57 / 74

Top Scorers

#	Model	Lab	Source	Score
01	GPT-5 High	OpenAI	Closed	75.6
02	GPT-5.1	OpenAI	Closed	75.0
03	GPT-5.1 High	OpenAI	Closed	75.0
04	GPT-5.5	OpenAI	Closed	74.3
05	MiniMax M3	MiniMax	Open	74.0
06	Claude Opus 4.5 (Thinking 32K)	Anthropic	Closed	74.0
07	GPT-5.4	OpenAI	Closed	74.0
08	GPT-5.4 High	OpenAI	Closed	74.0
09	MiMo-V2.5-Pro	Xiaomi	Closed	73.3
10	GPT-5.2	OpenAI	Closed	72.7
11	GPT-5.2 High	OpenAI	Closed	72.7
12	Gemini 3.1 Pro Preview	Google	Closed	72.7
13	GLM-5.2	Z.ai	Open	71.3
14	Gemini 3 Pro	Google	Closed	70.7
15	Claude Opus 4.6 (Thinking)	Anthropic	Closed	70.7

Score Distribution

Open vs Closed Source

Gap on AA LCR:+1.6pts closed leads

Top Open-Source Models

1MiniMax M374
2GLM-5.271.3
3Kimi K2.669.7

Top Closed-Source Models

1GPT-5 High75.6
2GPT-5.175
3GPT-5.1 High75

Score vs Parameter Count

72 model(s) with undisclosed parameter counts not shown. Most closed-source labs do not publish model size.

Average Score by Lab

MiniMax
69.6n = 3
Xiaomi
65.6n = 3
Moonshot AI
61.8n = 6
OpenAI
60.3n = 21
Anthropic
59.0n = 17
xAI
58.1n = 10
Z.ai
55.9n = 6
Alibaba
49.9n = 15
DeepSeek
49.5n = 6
NVIDIA
47.8n = 2

Most Correlated Benchmarks

GPQA
+0.90n = 131
LiveCodeBench
+0.88n = 75
AA Intelligence Index
+0.85n = 131
SWE-Pro
+0.84n = 14
SciCode
+0.81n = 131
Terminal Bench Hard
+0.80n = 125
MMLU-PRO
+0.76n = 86
IFBench
+0.76n = 130
HLE
+0.74n = 131
Arena Score
+0.71n = 105
MATH-500
+0.68n = 41
Terminal Bench
+0.60n = 17
HMMT 2026
+0.38n = 9
SWE-Verified
+0.37n = 19
AIME 2026
+0.31n = 10
Pearson r: −1 to +1. Positive means the two benchmarks rank models in similar order; negative means the opposite.

What It Captures Well

Stresses reasoning, not just retrieval.
Length-tier breakdowns show where models start to fail.
Comparable across labs using the same harness.

Where It Falls Short

Closed methodology compared to academic benchmarks.
Sensitive to chunking and prompt structure.
Different from RAG benchmarks, which test retrieval pipelines.

Frequently Asked Questions

Is this the same as Needle-in-a-Haystack?

No. Needle-in-a-Haystack only tests recall of a single planted fact. AA LCR requires reasoning over information spread throughout the long context, which is much harder.

Related Benchmarks

Based on score correlations across our database.

Pearson r +0.90

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

AA LCR: Artificial Analysis Long Context Reasoning

How It Works

Top Scorers

Score Distribution

Open vs Closed Source

Score vs Parameter Count

Average Score by Lab

Most Correlated Benchmarks

What It Captures Well

Where It Falls Short

Frequently Asked Questions

Related Benchmarks

GPQA

LiveCodeBench

AA Intelligence Index

SWE-Pro

The AI Build Report

AA LCR: Artificial Analysis Long Context Reasoning

How It Works

Top Scorers

Score Distribution

Open vs Closed Source

Score vs Parameter Count

Average Score by Lab

Most Correlated Benchmarks

What It Captures Well

Where It Falls Short

Frequently Asked Questions

Related Benchmarks

GPQA

LiveCodeBench

AA Intelligence Index

SWE-Pro

The AI Build Report