made by agents

Reasoning variant of Kimi K2, trained to interleave chain-of-thought reasoning with function calls. Sets SOTA on Humanity's Last Exam and BrowseComp. Native INT4 quantization via QAT for 2x speedup. Maintains coherent tool use across 200-300 consecutive invocations.
Copy and paste this command to start running the model locally.
ollama run kimi-k2-thinking:cloudAccess model weights, configuration files, and documentation.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 77.9 GB | Low | Aggressive quantization — smallest size, noticeable quality loss |
| Q4_K_MRecommended | 84.6 GB | Good | Best balance of size and quality for most use-cases |
| Q5_K_M | 87.8 GB | Very Good | Slightly better quality than Q4 with moderate size increase |
| Q6_K | 91.6 GB | Excellent | Near-lossless quality with manageable size |
| Q8_0 | 99.6 GB | Near Perfect | Virtually indistinguishable from full precision |
| FP16 | 130.0 GB | Full | Full 16-bit floating point — maximum quality, largest size |
See which devices can run this model and at what quality level.
NVIDIA H200 SXM 141GBNVIDIA | SS | 45.7 tok/s | 84.6 GB | |
| SS | 50.4 tok/s | 84.6 GB | ||
NVIDIA B200 GPUNVIDIA | SS | 76.1 tok/s | 84.6 GB | |
| SS | 57.1 tok/s | 84.6 GB | ||
| SS | 35.2 tok/s | 84.6 GB | ||
| SS | 76.1 tok/s | 84.6 GB | ||
Google Cloud TPU v5pGoogle | AA | 26.3 tok/s | 84.6 GB | |
| AA | 23.3 tok/s | 84.6 GB | ||
| BB | 7.6 tok/s | 84.6 GB | ||
| BB | 7.6 tok/s | 84.6 GB | ||
| BB | 5.8 tok/s | 84.6 GB | ||
| BB | 5.8 tok/s | 84.6 GB | ||
| BB | 5.8 tok/s | 84.6 GB | ||
| BB | 5.2 tok/s | 84.6 GB | ||
| BB | 5.2 tok/s | 84.6 GB | ||
| BB | 5.2 tok/s | 84.6 GB | ||
| BB | 5.2 tok/s | 84.6 GB | ||
| BB | 2.6 tok/s | 84.6 GB | ||
| BB | 7.8 tok/s | 84.6 GB | ||
| BB | 7.8 tok/s | 84.6 GB | ||
NVIDIA H100 SXM5 80GBNVIDIA | BB | 31.9 tok/s | 84.6 GB | |
| BB | 3.8 tok/s | 84.6 GB | ||
NVIDIA A100 SXM4 80GBNVIDIA | CC | 19.4 tok/s | 84.6 GB | |
| FF | 2.7 tok/s | 84.6 GB | ||
| FF | 4.1 tok/s | 84.6 GB |
Kimi K2 Thinking is Moonshot AI’s flagship reasoning model, designed to compete directly with the highest tier of frontier LLMs. Unlike standard chat models, K2 Thinking utilizes an advanced "Chain-of-Thought" (CoT) process, interleaving internal reasoning steps with external function calls. This architecture allows it to verify its own logic and execute complex tool-use sequences before delivering a final answer. At 1,000B parameters, it is one of the largest MoE (Mixture of Experts) models available for local deployment, specifically optimized for high-stakes reasoning, mathematics, and long-context code generation.
While the total parameter count is massive, the model utilizes a sparse architecture where only 32B parameters are active during any single inference pass. This makes Kimi K2 Thinking MoE efficiency uniquely high for its weight class; you get the knowledge breadth of a trillion-parameter model with the inference latency closer to a medium-sized dense model. It currently sets State-of-the-Art (SOTA) marks on "Humanity's Last Exam" (GPQA) and the BrowseComp benchmark, outperforming many proprietary models in complex instruction following and multi-step problem solving.
Kimi K2 Thinking is built on a massive MoE framework totaling 1,000B parameters. The 32B active parameter count is the critical metric for practitioners calculating Kimi K2 Thinking tokens per second. Because only a fraction of the weights are triggered per token, the compute requirement is significantly lower than a 1,000B dense model, though the VRAM footprint remains dictated by the total parameter count unless aggressive quantization is used.
Key technical specifications include:
The 256k context window is particularly robust. Unlike models that suffer from "lost in the middle" syndrome, K2 Thinking maintains high retrieval accuracy across the entire buffer. The model was trained specifically to interleave reasoning with tool use, meaning it doesn't just call a function—it reasons about why it is calling it, evaluates the output, and adjusts its strategy in real-time.
Kimi K2 Thinking for coding is one of the primary use cases for this model. It excels at refactoring large codebases and debugging complex logic errors that require understanding dependencies across multiple files. Because it can handle 200-300 consecutive function calls without losing coherence, it is an ideal engine for autonomous coding agents and local CI/CD analysis tools.
Other high-performance use cases include:
The model's native INT4 quantization via QAT provides a 2x speedup over standard FP16 inference without the typical accuracy degradation seen in post-training quantization (PTQ). This makes it a prime candidate for users looking to run Kimi K2 Thinking locally on high-end workstation hardware.
The primary challenge for this model is the Kimi K2 Thinking VRAM requirements. Even with its MoE efficiency, the 1,000B total parameter count necessitates significant memory overhead. You cannot run the full-weight model on a single consumer GPU.
To determine the best GPU for Kimi K2 Thinking, you must first decide on your quantization target.
On a high-bandwidth 8x H100 or A100 cluster, you can expect 20-30 tokens per second. On consumer-grade multi-GPU setups (e.g., 4x 4090), expect closer to 2-5 tokens per second due to PCIe bottlenecking and the sheer volume of weights being shifted during MoE routing.
When evaluating Kimi K2 Thinking vs DeepSeek-V3 or Llama 3.1 405B, the distinction lies in the reasoning architecture.
Choose Kimi K2 Thinking if your priority is local AI model 1000B parameters 2025 state-of-the-art reasoning and you have the VRAM capacity to support a trillion-parameter MoE. If you are limited to 128GB of VRAM or less, you will likely find better performance with smaller, dense models or more aggressively pruned MoEs.