
Reasoning variant of Kimi K2, trained to interleave chain-of-thought reasoning with function calls. Sets SOTA on Humanity's Last Exam and BrowseComp. Native INT4 quantization via QAT for 2x speedup. Maintains coherent tool use across 200-300 consecutive invocations.
A workable 1000B-parameter MoE language model from Moonshot AI. Pulls ahead on graduate-level reasoning (GPQA) (85/100), so reach for it when that's the dimension that matters.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Copy and paste this command to start running the model locally.
ollama run kimi-k2-thinking:cloudAccess model weights, configuration files, and documentation.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 77.9 GB | Low | |
| Q4_K_MRecommended | 84.6 GB | Good | |
| Q5_K_M | 87.8 GB | Very Good | |
| Q6_K | 91.6 GB | Excellent | |
| Q8_0 | 99.6 GB | Near Perfect | |
| FP16 | 130.0 GB | Full |
See which devices can run this model and at what quality level.
NVIDIA H200 SXM 141GBNVIDIA | SS | 45.7 tok/s | 84.6 GB | |
| SS | 50.4 tok/s | 84.6 GB | ||
Google TPU v7 (Ironwood)Google | SS | 70.2 tok/s | 84.6 GB | |
NVIDIA B200 GPUNVIDIA | SS | 76.1 tok/s | 84.6 GB | |
| SS | 57.1 tok/s | 84.6 GB | ||
| SS | 35.2 tok/s | 84.6 GB | ||
| SS | 76.1 tok/s | 84.6 GB | ||
| SS | 67.6 tok/s | 84.6 GB | ||
| SS | 67.6 tok/s | 84.6 GB | ||
| SS | 67.6 tok/s | 84.6 GB | ||
| SS | 67.6 tok/s | 84.6 GB | ||
SuperMicro Super AI StationSuperMicro | SS | 67.6 tok/s | 84.6 GB | |
Gigabyte W775-V10-L01Gigabyte | SS | 67.6 tok/s | 84.6 GB | |
Google Cloud TPU v5pGoogle | AA | 26.3 tok/s | 84.6 GB | |
| AA | 23.3 tok/s | 84.6 GB | ||
| BB | 7.6 tok/s | 84.6 GB | ||
| BB | 7.6 tok/s | 84.6 GB | ||
| BB | 5.8 tok/s | 84.6 GB | ||
| BB | 5.8 tok/s | 84.6 GB | ||
| BB | 5.8 tok/s | 84.6 GB | ||
| BB | 5.2 tok/s | 84.6 GB | ||
| BB | 5.2 tok/s | 84.6 GB | ||
| BB | 5.2 tok/s | 84.6 GB | ||
| BB | 5.2 tok/s | 84.6 GB | ||
| BB | 2.6 tok/s | 84.6 GB |
Energy cost on NVIDIA A100 SXM4 80GB (~19 tok/s, Q4_K_M) vs flagship API pricing.
| Source | Cost per 1M tokens |
|---|---|
Local (energy only)Kimi K2 Thinking on NVIDIA A100 SXM4 80GB · ~19 tok/s · 400W | $0.687 |
GPT-5.5OpenAI · in $5.00 · out $30.00 | $12.50 |
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00 | $11.00 |
Gemini 3.1 Flash Lite PreviewGoogle · in $0.250 · out $1.50 | $0.625 |
Grok 4.3 betaxAI · in $3.00 · out $15.00 | $6.60 |
API prices blended at 70% input / 30% output.
Hardware amortisation not included. Run the full ROI calculator for payback math.
Kimi K2 Thinking is Moonshot AI’s flagship reasoning model, designed to compete directly with the highest tier of frontier LLMs. Unlike standard chat models, K2 Thinking utilizes an advanced "Chain-of-Thought" (CoT) process, interleaving internal reasoning steps with external function calls. This architecture allows it to verify its own logic and execute complex tool-use sequences before delivering a final answer. At 1,000B parameters, it is one of the largest MoE (Mixture of Experts) models available for local deployment, specifically optimized for high-stakes reasoning, mathematics, and long-context code generation.
While the total parameter count is massive, the model utilizes a sparse architecture where only 32B parameters are active during any single inference pass. This makes Kimi K2 Thinking MoE efficiency uniquely high for its weight class; you get the knowledge breadth of a trillion-parameter model with the inference latency closer to a medium-sized dense model. It currently sets State-of-the-Art (SOTA) marks on "Humanity's Last Exam" (GPQA) and the BrowseComp benchmark, outperforming many proprietary models in complex instruction following and multi-step problem solving.
Kimi K2 Thinking is built on a massive MoE framework totaling 1,000B parameters. The 32B active parameter count is the critical metric for practitioners calculating Kimi K2 Thinking tokens per second. Because only a fraction of the weights are triggered per token, the compute requirement is significantly lower than a 1,000B dense model, though the VRAM footprint remains dictated by the total parameter count unless aggressive quantization is used.
Key technical specifications include:
The 256k context window is particularly robust. Unlike models that suffer from "lost in the middle" syndrome, K2 Thinking maintains high retrieval accuracy across the entire buffer. The model was trained specifically to interleave reasoning with tool use, meaning it doesn't just call a function—it reasons about why it is calling it, evaluates the output, and adjusts its strategy in real-time.
Kimi K2 Thinking for coding is one of the primary use cases for this model. It excels at refactoring large codebases and debugging complex logic errors that require understanding dependencies across multiple files. Because it can handle 200-300 consecutive function calls without losing coherence, it is an ideal engine for autonomous coding agents and local CI/CD analysis tools.
Other high-performance use cases include:
The model's native INT4 quantization via QAT provides a 2x speedup over standard FP16 inference without the typical accuracy degradation seen in post-training quantization (PTQ). This makes it a prime candidate for users looking to run Kimi K2 Thinking locally on high-end workstation hardware.
The primary challenge for this model is the Kimi K2 Thinking VRAM requirements. Even with its MoE efficiency, the 1,000B total parameter count necessitates significant memory overhead. You cannot run the full-weight model on a single consumer GPU.
To determine the best GPU for Kimi K2 Thinking, you must first decide on your quantization target.
On a high-bandwidth 8x H100 or A100 cluster, you can expect 20-30 tokens per second. On consumer-grade multi-GPU setups (e.g., 4x 4090), expect closer to 2-5 tokens per second due to PCIe bottlenecking and the sheer volume of weights being shifted during MoE routing.
When evaluating Kimi K2 Thinking vs DeepSeek-V3 or Llama 3.1 405B, the distinction lies in the reasoning architecture.
Choose Kimi K2 Thinking if your priority is local AI model 1000B parameters 2025 state-of-the-art reasoning and you have the VRAM capacity to support a trillion-parameter MoE. If you are limited to 128GB of VRAM or less, you will likely find better performance with smaller, dense models or more aggressively pruned MoEs.
