Tests reasoning over inputs from 10,000 to 100,000 tokens, well past what shorter benchmarks measure.
Most long-context benchmarks only test whether a model can retrieve a specific fact from a long input ("needle in a haystack"). AA LCR goes further: it tests reasoning that requires synthesizing information spread across the entire long context. Scores at the longest tiers separate models that genuinely use their context window from models that only claim to.
Models receive a long input plus a question that requires reasoning across multiple sections. Scores are reported per length tier so users can see where each model breaks down.
| # | Model | Lab | Source | Score |
|---|---|---|---|---|
| 01 | GPT-5.1 | OpenAI | Closed | 75.0 |
| 02 | GPT-5.5 | OpenAI | Closed | 74.3 |
| 03 | GPT-5.4 | OpenAI | Closed | 74.0 |
| 04 | MiMo-V2.5-Pro | Xiaomi | Closed | 73.3 |
| 05 | GPT-5.2 | OpenAI | Closed | 72.7 |
| 06 | Gemini 3.1 Pro Preview | Closed | 72.7 | |
| 07 | Gemini 3 Pro | Closed | 70.7 | |
| 08 |
30 model(s) with undisclosed parameter counts not shown. Most closed-source labs do not publish model size.
No. Needle-in-a-Haystack only tests recall of a single planted fact. AA LCR requires reasoning over information spread throughout the long context, which is much harder.
Based on score correlations across our database.
| Anthropic |
| Closed |
| 70.3 |
| 09 | Qwen3.6 Max Preview | Alibaba | Closed | 69.7 |
| 10 | Kimi K2.6 | Moonshot AI | Open | 69.7 |
| 11 | Qwen3.6-Plus | Alibaba | Closed | 69.7 |
| 12 | Gemini 3.5 Flash | Closed | 69.3 |
| 13 | Qwen3.6-27B | Alibaba | Open | 68.7 |
| 14 | MiniMax M2.7 | MiniMax | Closed | 68.7 |
| 15 | Grok 4.1 Fast Reasoning | xAI | Closed | 68.0 |