Score 20–100 GPQA GPT-5.2 · 93.2 (OpenAI, United States) GPT-5.4 · 92.8 (OpenAI, United States) Gemini 3 Pro · 91.9 (Google, United States) Claude Opus 4.6 · 91.3 (Anthropic, United States) Kimi K2.6 · 90.5 (Moonshot AI, China) Gemini 3 Flash · 90.4 (Google, United States) DeepSeek-V4-Pro · 90.1 (DeepSeek, China) Claude Sonnet 4.6 · 89.9 (Anthropic, United States) Qwen3.5-397B-A17B · 88.4 (Alibaba, China) DeepSeek-V4-Flash · 88.1 (DeepSeek, China) GPT-5.1 · 88.1 (OpenAI, United States) Qwen3.6-27B · 87.8 (Alibaba, China) Kimi K2.5 · 87.6 (Moonshot AI, China) Qwen3.5-122B-A10B · 86.6 (Alibaba, China) GLM-5.1 · 86.2 (Z.ai, China) GLM-5 · 86 (Z.ai, China) Qwen3.6 35B-A3B · 86 (Alibaba, China) GLM-4.7 · 85.7 (Z.ai, China) Qwen3.5-27B · 85.5 (Alibaba, China) minimax-m2.5 · 85.2 (MiniMax, China) Kimi K2 Thinking · 84.5 (Moonshot AI, China) Gemma 4 31B IT · 84.3 (Google, United States) Qwen3.5-35B-A3B · 84.2 (Alibaba, China) Claude Sonnet 4.5 · 83.4 (Anthropic, United States) DeepSeek-V3.2 · 82.4 (DeepSeek, China) Gemma 4 26B-A4B IT · 82.3 (Google, United States) Qwen3.5-9B · 81.7 (Alibaba, China) Nvidia Nemotron 3 Super · 79.2 (NVIDIA, United States) Claude Haiku 4.5 · 75 (Anthropic, United States) Llama 4 Scout · 73 (Meta, United States) DeepSeek-R1 · 71.5 (DeepSeek, China) Gemma 4 E4B IT · 58.6 (Google, United States) Gemma 4 E2B IT · 43.4 (Google, United States) Llama 3.1 8B Instruct · 30.4 (Meta, United States) 55–95 MMLU-PRO Gemini 3 Flash · 88.6 (Google, United States) Qwen3.5-397B-A17B · 87.8 (Alibaba, China) DeepSeek-V4-Pro · 87.5 (DeepSeek, China) Kimi K2.5 · 87.1 (Moonshot AI, China) DeepSeek-V4-Flash · 86.4 (DeepSeek, China) Qwen3.6-27B · 86.2 (Alibaba, China) Gemma 4 31B IT · 85.2 (Google, United States) Qwen3.6 35B-A3B · 85.2 (Alibaba, China) DeepSeek-V3.2 · 85 (DeepSeek, China) DeepSeek-R1 · 84 (DeepSeek, China) Nvidia Nemotron 3 Super · 83.7 (NVIDIA, United States) Gemma 4 26B-A4B IT · 82.6 (Google, United States) Qwen3.5-9B · 82.5 (Alibaba, China) GPT-5.2 · 80 (OpenAI, United States) Claude Opus 4.6 · 78.5 (Anthropic, United States) Claude Haiku 4.5 · 72 (Anthropic, United States) Gemma 4 E4B IT · 69.4 (Google, United States) DeepSeek-V3 · 64.4 (DeepSeek, China) Gemma 4 E2B IT · 60 (Google, United States) 82–96 GSM8K DeepSeek-V4-Pro · 92.6 (DeepSeek, China) DeepSeek-V3 · 89.3 (DeepSeek, China) Llama 3.1 8B Instruct · 84.5 (Meta, United States) 50–85 SWE-Verified Claude Opus 4.6 · 80.8 (Anthropic, United States) DeepSeek-V4-Pro · 80.6 (DeepSeek, China) Kimi K2.6 · 80.2 (Moonshot AI, China) GPT-5.2 · 80 (OpenAI, United States) Claude Sonnet 4.6 · 79.6 (Anthropic, United States) DeepSeek-V4-Flash · 79 (DeepSeek, China) Claude Sonnet 4.5 · 77.2 (Anthropic, United States) Qwen3.6-27B · 77.2 (Alibaba, China) Qwen3.5-397B-A17B · 76.4 (Alibaba, China) Gemini 3 Pro · 76.2 (Google, United States) minimax-m2.5 · 75.8 (MiniMax, China) GPT-5.1 · 74.9 (OpenAI, United States) GLM-4.7 · 73.8 (Z.ai, China) Qwen3.6 35B-A3B · 73.4 (Alibaba, China) GLM-5 · 72.8 (Z.ai, China) Gemini 3 Flash · 72.5 (Google, United States) Qwen3.5-27B · 72.4 (Alibaba, China) Qwen3.5-122B-A10B · 72 (Alibaba, China) Kimi K2 Thinking · 71.3 (Moonshot AI, China) Kimi K2.5 · 70.8 (Moonshot AI, China) DeepSeek-V3.2 · 70 (DeepSeek, China) Qwen3.5-35B-A3B · 69.2 (Alibaba, China) Claude Haiku 4.5 · 68 (Anthropic, United States) Llama 4 Scout · 55 (Meta, United States) Nvidia Nemotron 3 Super · 53.7 (NVIDIA, United States) 0–60 HLE GPT-5.4 · 52.1 (OpenAI, United States) Kimi K2.5 · 50.2 (Moonshot AI, China) DeepSeek-V3.2 · 40.8 (DeepSeek, China) Claude Opus 4.6 · 40 (Anthropic, United States) DeepSeek-V4-Pro · 37.7 (DeepSeek, China) Gemini 3 Pro · 37.5 (Google, United States) GPT-5.2 · 35.4 (OpenAI, United States) DeepSeek-V4-Flash · 34.8 (DeepSeek, China) Kimi K2.6 · 34.7 (Moonshot AI, China) Gemini 3 Flash · 33.7 (Google, United States) GLM-5.1 · 31 (Z.ai, China) Claude Sonnet 4.5 · 30.8 (Anthropic, United States) GLM-5 · 30.5 (Z.ai, China) Qwen3.5-397B-A17B · 28.7 (Alibaba, China) GPT-5.1 · 26 (OpenAI, United States) Qwen3.5-122B-A10B · 25.3 (Alibaba, China) GLM-4.7 · 24.8 (Z.ai, China) Qwen3.5-27B · 24.3 (Alibaba, China) Qwen3.6-27B · 24 (Alibaba, China) Kimi K2 Thinking · 23.9 (Moonshot AI, China) Qwen3.5-35B-A3B · 22.4 (Alibaba, China) Qwen3.6 35B-A3B · 21.4 (Alibaba, China) Gemma 4 31B IT · 19.5 (Google, United States) minimax-m2.5 · 19.4 (MiniMax, China) Nvidia Nemotron 3 Super · 18.3 (NVIDIA, United States) Llama 4 Scout · 12 (Meta, United States) Gemma 4 26B-A4B IT · 8.7 (Google, United States) 30–100 AIME 2026 Claude Opus 4.6 · 100 (Anthropic, United States) Claude Sonnet 4.5 · 100 (Anthropic, United States) GPT-5.2 · 100 (OpenAI, United States) Kimi K2.6 · 96.4 (Moonshot AI, China) GLM-5 · 95.8 (Z.ai, China) Kimi K2.5 · 95.8 (Moonshot AI, China) GLM-5.1 · 95.3 (Z.ai, China) Gemini 3 Flash · 95 (Google, United States) Gemini 3 Pro · 95 (Google, United States) DeepSeek-V3.2 · 94.2 (DeepSeek, China) Qwen3.6-27B · 94.1 (Alibaba, China) GPT-5.1 · 94 (OpenAI, United States) Qwen3.5-35B-A3B · 93.3 (Alibaba, China) Qwen3.5-397B-A17B · 93.3 (Alibaba, China) Qwen3.6 35B-A3B · 92.7 (Alibaba, China) Qwen3.5-9B · 92.5 (Alibaba, China) Qwen3.5-27B · 90.8 (Alibaba, China) Nvidia Nemotron 3 Super · 90 (NVIDIA, United States) Gemma 4 31B IT · 89.2 (Google, United States) Gemma 4 26B-A4B IT · 88.3 (Google, United States) Llama 4 Scout · 85 (Meta, United States) Claude Sonnet 4.6 · 83 (Anthropic, United States) Gemma 4 E4B IT · 42.5 (Google, United States) Gemma 4 E2B IT · 37.5 (Google, United States) 20–80 Terminal Bench Claude Opus 4.6 · 74.7 (Anthropic, United States) DeepSeek-V4-Pro · 67.9 (DeepSeek, China) Kimi K2.6 · 66.7 (Moonshot AI, China) GPT-5.2 · 64.9 (OpenAI, United States) Gemini 3 Flash · 64.3 (Google, United States) GLM-5.1 · 63.5 (Z.ai, China) Qwen3.6-27B · 59.3 (Alibaba, China) DeepSeek-V4-Flash · 56.9 (DeepSeek, China) Claude Sonnet 4.6 · 53 (Anthropic, United States) Qwen3.5-397B-A17B · 52.5 (Alibaba, China) GLM-5 · 52.4 (Z.ai, China) Qwen3.6 35B-A3B · 51.5 (Alibaba, China) Claude Sonnet 4.5 · 51 (Anthropic, United States) Qwen3.5-122B-A10B · 49.4 (Alibaba, China) Kimi K2.5 · 43.2 (Moonshot AI, China) Qwen3.5-27B · 41.6 (Alibaba, China) Qwen3.5-35B-A3B · 40.5 (Alibaba, China) DeepSeek-V3.2 · 39.6 (DeepSeek, China) Kimi K2 Thinking · 35.7 (Moonshot AI, China) Claude Haiku 4.5 · 35.5 (Anthropic, United States) GLM-4.7 · 33.4 (Z.ai, China) Nvidia Nemotron 3 Super · 31 (NVIDIA, United States) Kimi K2 Instruct · 27.8 (Moonshot AI, China) GLM-4.6 · 24.5 (Z.ai, China) 0–80 SWE-Pro Gemini 3 Flash · 71.2 (Google, United States) Kimi K2.6 · 58.6 (Moonshot AI, China) GLM-5.1 · 58.4 (Z.ai, China) GPT-5.4 · 57.7 (OpenAI, United States) DeepSeek-V4-Pro · 55.4 (DeepSeek, China) minimax-m2.5 · 55.4 (MiniMax, China) Qwen3.6-27B · 53.5 (Alibaba, China) Kimi K2.5 · 50.7 (Moonshot AI, China) Qwen3.6 35B-A3B · 49.5 (Alibaba, China) Claude Opus 4.6 · 45 (Anthropic, United States) Kimi K2 Instruct · 27.7 (Moonshot AI, China) Qwen3-235B-A22B · 21.4 (Alibaba, China) DeepSeek-V3.2 · 15.6 (DeepSeek, China) Claude Haiku 4.5 · 14 (Anthropic, United States) Gemma 3 27B IT · 11.4 (Google, United States) Llama 3.1 405B Instruct · 11.2 (Meta, United States) GLM-4.6 · 9.7 (Z.ai, China) Llama 4 Maverick · 5.2 (Meta, United States) Llama 4 Scout · 5.2 (Meta, United States) 65–85 EvasionBench GLM-4.7 · 82.9 (Z.ai, China) DeepSeek-V3.2 · 66.9 (DeepSeek, China) Kimi K2 Instruct 0905 · 66.7 (Moonshot AI, China) 65–95 HMMT 2026 Kimi K2.6 · 92.7 (Moonshot AI, China) Qwen3.5-397B-A17B · 87.9 (Alibaba, China) Kimi K2.5 · 87.1 (Moonshot AI, China) GLM-5 · 86.4 (Z.ai, China) Nvidia Nemotron 3 Super · 84.8 (NVIDIA, United States) Qwen3.6-27B · 84.3 (Alibaba, China) DeepSeek-V3.2 · 84.1 (DeepSeek, China) Qwen3.6 35B-A3B · 83.6 (Alibaba, China) GLM-5.1 · 82.6 (Z.ai, China) Qwen3.5-35B-A3B · 81.8 (Alibaba, China) Qwen3.5-27B · 81.1 (Alibaba, China) Qwen3.5-9B · 71.2 (Alibaba, China) 20–100 LM Arena Claude Opus 4.6 (Thinking) · 100 (Anthropic, United States) Claude Opus 4.6 · 99.4 (Anthropic, United States) Gemini 3.1 Pro Preview · 98.1 (Google, United States) Claude Opus 4.7 Thinking · 98.0 (Anthropic, United States) Gemini 3 Pro · 96.9 (Google, United States) Claude Opus 4.7 · 96.6 (Anthropic, United States) Meta Muse Spark · 96.5 (Meta, United States) Qwen3.5 Max Preview · 95.3 (Alibaba, China) GPT-5.4 High · 95.3 (OpenAI, United States) GLM-5.1 · 95.1 (Z.ai, China) Gemini 3 Flash · 95.0 (Google, United States) GPT-5.5 · 94.2 (OpenAI, United States) Gemini 2.5 Pro · 94.0 (Google, United States) Grok 4.20 Beta 0309 Reasoning · 93.2 (xAI, United States) Kimi K2.6 · 93.2 (Moonshot AI, China) Dola Seed 2.0 Pro · 93.1 (ByteDance, China) GPT-5.4 · 92.8 (OpenAI, United States) Grok 4.20 Multi-Agent Beta 0309 · 92.8 (xAI, United States) ERNIE 5.0 0110 · 92.4 (Baidu, China) Grok 4.20 Beta1 · 92.3 (xAI, United States) Gemini 3 Flash (Thinking Minimal) · 92.2 (Google, United States) Claude Sonnet 4.6 · 92.2 (Anthropic, United States) Claude Opus 4.5 · 92.1 (Anthropic, United States) Claude Opus 4.5 (Thinking 32K) · 91.9 (Anthropic, United States) Kimi K2.5 · 91.8 (Moonshot AI, China) GLM-5 · 91.7 (Z.ai, China) Qwen3.5-397B-A17B · 91.5 (Alibaba, China) ERNIE 5.0 Preview 1203 · 91.3 (Baidu, China) Qwen3.6 Max Preview · 91.2 (Alibaba, China) Gemma 4 31B IT · 91.2 (Google, United States) GPT-5.1 High · 91.2 (OpenAI, United States) GLM-4.6 · 91.0 (Z.ai, China) Grok 4.1 (Thinking) · 91.0 (xAI, United States) GPT-5.2 Chat Latest · 90.8 (OpenAI, United States) Qwen3 Max Preview · 90.8 (Alibaba, China) Grok 4.1 · 90.5 (xAI, United States) GLM-4.7 · 90.4 (Z.ai, China) MiMo v2 Pro · 90.2 (Xiaomi, China) Gemma 4 26B-A4B IT · 90.1 (Google, United States) Claude Sonnet 4.5 · 90.1 (Anthropic, United States) ERNIE 5.0 Preview 1022 · 89.4 (Baidu, China) Claude Sonnet 4.5 (Thinking 32K) · 89.4 (Anthropic, United States) GLM-4.5 · 89.3 (Z.ai, China) ChatGPT-4o Latest (2025-03-26) · 89.2 (OpenAI, United States) DeepSeek-R1 · 89.1 (DeepSeek, China) Grok 3 Preview 02-24 · 88.7 (xAI, United States) DeepSeek-V3.2 · 88.5 (DeepSeek, China) Gemini 3.1 Flash Lite Preview · 88.4 (Google, United States) GPT-5.1 · 88.3 (OpenAI, United States) GPT-5.4 Mini High · 87.9 (OpenAI, United States) DeepSeek-V3.1 · 87.8 (DeepSeek, China) Qwen3.5-122B-A10B · 87.8 (Alibaba, China) Claude Opus 4.1 (Thinking 16K) · 87.7 (Anthropic, United States) GPT-5.2 High · 87.7 (OpenAI, United States) Gemini 2.5 Flash · 87.6 (Google, United States) Claude Opus 4.1 · 87.6 (Anthropic, United States) GPT-4.5 Preview · 87.5 (OpenAI, United States) GPT-5.2 · 87.3 (OpenAI, United States) Kimi K2 Thinking · 87.1 (Moonshot AI, China) Qwen3 Max (2025-09-23) · 86.9 (Alibaba, China) Grok 4 (0709) · 86.4 (xAI, United States) OpenAI o3 · 86.3 (OpenAI, United States) Grok 4.1 Fast Reasoning · 86.3 (xAI, United States) Grok 4 Fast Chat · 86.3 (xAI, United States) Gemini 2.5 Flash Preview 09-2025 · 86.0 (Google, United States) Hunyuan Vision 1.5 Thinking · 85.9 (Tencent, China) GPT-5 High · 85.7 (OpenAI, United States) Qwen3.5-27B · 85.7 (Alibaba, China) GPT-5 Chat · 85.5 (OpenAI, United States) Hunyuan T1 · 84.9 (Tencent, China) Qwen3.5 Flash · 84.8 (Alibaba, China) Grok 4 Fast Reasoning · 84.5 (xAI, United States) Qwen3.5-35B-A3B · 84.3 (Alibaba, China) Qwen3-235B-A22B · 84 (Alibaba, China) MiniMax M2.7 · 83.8 (MiniMax, China) Claude Haiku 4.5 · 83.2 (Anthropic, United States) GPT-5.3 Chat Latest · 82.4 (OpenAI, United States) GPT-4.1 · 82.2 (OpenAI, United States) Kimi K2 Instruct 0905 · 81.9 (Moonshot AI, China) Gemini 2.5 Flash Lite Preview 09-2025 (No Thinking) · 81.8 (Google, United States) Nvidia Nemotron 3 Super · 81.7 (NVIDIA, United States) GPT-5.4 Nano High · 81.7 (OpenAI, United States) Hunyuan TurboS (2025-04-16) · 81.3 (Tencent, China) Claude Opus 4 (Thinking 16K) · 81.2 (Anthropic, United States) DeepSeek-V3 · 81.2 (DeepSeek, China) GPT-5 Mini High · 81.0 (OpenAI, United States) Kimi K2 Instruct · 80.6 (Moonshot AI, China) minimax-m2.5 · 80.2 (MiniMax, China) Gemini 2.5 Flash Lite Preview 06-17 (Thinking) · 80.2 (Google, United States) Qwen2.5 Max · 79.9 (Alibaba, China) Grok 3 Mini High · 79.9 (xAI, United States) OpenAI o1 · 79.8 (OpenAI, United States) Claude Opus 4 · 79.6 (Anthropic, United States) Amazon Nova 2 Lite · 79.4 (Amazon, United States) Grok 3 Mini Beta · 79.3 (xAI, United States) Gemma 3 27B IT · 78.7 (Google, United States) Gemini 2.0 Flash · 78.0 (Google, United States) OpenAI o1 Preview · 77.9 (OpenAI, United States) OpenAI o4-mini · 77.9 (OpenAI, United States) Claude Sonnet 4 (Thinking 32K) · 77.2 (Anthropic, United States) GPT-4.1 Mini · 76.0 (OpenAI, United States) Qwen3-32B · 76.0 (Alibaba, China) Claude Sonnet 4 · 75.6 (Anthropic, United States) OpenAI o3-mini High · 75.4 (OpenAI, United States) Step 1o Turbo (202506) · 75.2 (StepFun, China) GLM-4 Plus (0111) · 74.6 (Zhipu, China) Gemini 2.0 Flash Lite Preview · 74.4 (Google, United States) Qwen Plus (0125) · 73.9 (Alibaba, China) Step 2 16K Exp (202412) · 73.1 (StepFun, China) GPT-5 Nano High · 72.9 (OpenAI, United States) Hunyuan TurboS (2025-02-26) · 72.9 (Tencent, China) OpenAI o3-mini · 72.8 (OpenAI, United States) Qwen3-30B-A3B · 72.5 (Alibaba, China) OpenAI o1-mini · 72.5 (OpenAI, United States) Claude 3.7 Sonnet (Thinking 32K) · 72.1 (Anthropic, United States) Hunyuan Turbo (0110) · 71.7 (Tencent, China) Grok 2 · 70.6 (xAI, United States) Yi Lightning · 70.2 (01 AI, China) GPT-4o · 70.0 (OpenAI, United States) Gemma 3 4B IT · 68.6 (Google, United States) Llama 4 Maverick · 68.1 (Meta, United States) Llama 3.1 405B Instruct · 67.5 (Meta, United States) Llama 4 Scout · 67.0 (Meta, United States) Llama 3.3 70B Instruct · 66.2 (Meta, United States) Llama 3.1 70B Instruct · 64.1 (Meta, United States) Llama 3 70B Instruct · 58.1 (Meta, United States) Llama 3.1 8B Instruct · 53.0 (Meta, United States) Llama 3 8B Instruct · 49.8 (Meta, United States) Llama 2 70B Chat · 42.3 (Meta, United States) Llama 2 13B Chat · 37.6 (Meta, United States) Llama 2 7B Chat · 33.0 (Meta, United States)