made by agents

Moonshot AI's 1 trillion parameter MoE model with 32B active parameters. Trained on 15.5T tokens using the Muon optimizer. Optimized for agentic capabilities including tool use and autonomous problem-solving. Achieves 65.8% on SWE-bench Verified.
Copy and paste this command to start running the model locally.
ollama run kimi-k2:1t-cloudAccess model weights, configuration files, and documentation.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 45.1 GB | Low | Aggressive quantization — smallest size, noticeable quality loss |
| Q4_K_MRecommended | 51.8 GB | Good | Best balance of size and quality for most use-cases |
| Q5_K_M | 55.0 GB | Very Good | Slightly better quality than Q4 with moderate size increase |
| Q6_K | 58.9 GB | Excellent | Near-lossless quality with manageable size |
| Q8_0 | 66.9 GB | Near Perfect | Virtually indistinguishable from full precision |
| FP16 | 97.3 GB | Full | Full 16-bit floating point — maximum quality, largest size |
See which devices can run this model and at what quality level.
Google Cloud TPU v5pGoogle | SS | 42.9 tok/s | 51.8 GB | |
NVIDIA H100 SXM5 80GBNVIDIA | SS | 52.0 tok/s | 51.8 GB | |
| SS | 38.1 tok/s | 51.8 GB | ||
| SS | 57.5 tok/s | 51.8 GB | ||
NVIDIA H200 SXM 141GBNVIDIA | SS | 74.6 tok/s | 51.8 GB | |
| SS | 82.3 tok/s | 51.8 GB | ||
NVIDIA B200 GPUNVIDIA | SS | 124.3 tok/s | 51.8 GB | |
| SS | 93.2 tok/s | 51.8 GB | ||
NVIDIA A100 SXM4 80GBNVIDIA | SS | 31.7 tok/s | 51.8 GB | |
| SS | 124.3 tok/s | 51.8 GB | ||
| AA | 12.4 tok/s | 51.8 GB | ||
| BB | 9.5 tok/s | 51.8 GB | ||
| BB | 9.5 tok/s | 51.8 GB | ||
| BB | 9.5 tok/s | 51.8 GB | ||
| BB | 6.2 tok/s | 51.8 GB | ||
| BB | 12.4 tok/s | 51.8 GB | ||
| BB | 8.5 tok/s | 51.8 GB | ||
| BB | 8.5 tok/s | 51.8 GB | ||
| BB | 8.5 tok/s | 51.8 GB | ||
| BB | 8.5 tok/s | 51.8 GB | ||
| BB | 12.7 tok/s | 51.8 GB | ||
| BB | 12.7 tok/s | 51.8 GB | ||
| BB | 4.2 tok/s | 51.8 GB | ||
| BB | 6.2 tok/s | 51.8 GB | ||
| BB | 4.8 tok/s | 51.8 GB |
Kimi K2 Instruct is a massive-scale Mixture of Experts (MoE) model developed by Moonshot AI. With a total parameter count of 1 trillion (1000B), it represents a significant push in the open-weights landscape toward frontier-level performance. Despite its total size, it utilizes a sparse architecture where only 32 billion parameters are active during any single forward pass. This design allows Kimi K2 to offer the reasoning depth of a trillion-parameter model while maintaining the inference latency of a much smaller dense model.
Trained on a massive 15.5 trillion token dataset using the innovative Muon optimizer, Kimi K2 Instruct is specifically tuned for agentic workflows. It competes directly with other high-parameter MoE models like DeepSeek-V3 and Grok-1. For developers looking to run Kimi K2 Instruct locally, the model offers a compelling balance: state-of-the-art performance on coding and reasoning benchmarks with an efficiency profile that makes it accessible on high-end workstation hardware.
The defining characteristic of Kimi K2 Instruct is its MoE efficiency. By activating only 32B of its 1000B parameters per token, the model significantly reduces the compute requirements for inference compared to a dense 1000B model. However, the primary bottleneck for local deployment is not compute, but memory. Even with sparse activation, the entire 1000B parameter set must reside in VRAM (or system RAM) to avoid massive latency penalties during expert switching.
The 128k context length is robust enough for large-scale document analysis and complex repository-level coding tasks. Because it was trained with the Muon optimizer, the model exhibits better convergence and stability in instruction-following compared to many first-generation MoE models.
Kimi K2 Instruct is positioned as an "agentic" model, meaning it excels at multi-step problem solving and tool interaction. Its reasoning benchmark scores are particularly high, notably achieving 65.8% on SWE-bench Verified, placing it among the top-performing models for autonomous software engineering.
The primary challenge for this model is the Kimi K2 Instruct hardware requirements. Because the model has 1000B parameters, the VRAM footprint is substantial, even before accounting for the KV cache.
To calculate the VRAM needed, you must look at the quantization level. A 1000B model in 16-bit (FP16) would require 2TB of VRAM, which is impossible on consumer hardware. Quantization is mandatory for local use.
Running a local AI model with 1000B parameters in 2025 requires a distributed setup or a high-memory workstation:
For most practitioners, Q4_K_M is the gold standard for maintaining intelligence. However, given the 1000B scale, IQ4_XS or even Q3_K_L are often more practical. The "intelligence collapse" typically seen in smaller models at 3-bit is less pronounced in 1000B models, meaning a 3-bit Kimi K2 will still outperform a 4-bit 70B model.
The Kimi K2 Instruct tokens per second (t/s) will vary based on your interconnect (NVLink vs. PCIe Gen4/5). On a well-optimized 8x GPU setup, expect 5-12 t/s.
Ollama remains the most accessible entry point for local testing. Once you have the necessary memory pool, you can run:
ollama run kimi-k2
(Note: Ensure your backend is configured for multi-GPU memory pooling).
When evaluating Kimi K2 Instruct, the most relevant comparisons are DeepSeek-V3 and Grok-1, both of which utilize MoE architectures.
| Feature | Kimi K2 Instruct | DeepSeek-V3 | Llama 3.1 405B |
| :--- | :--- | :--- | :--- |
| Architecture | MoE (1000B) | MoE (671B) | Dense (405B) |
| Active Params | 32B | 37B | 405B |
| Context | 128k | 128k | 128k |
| Best For | Agents & Reasoning | Coding & Logic | General Purpose |
| Min VRAM (Q4) | ~600GB | ~400GB | ~230GB |
For users with massive VRAM availability, Kimi K2 Instruct is the superior choice for autonomous agent tasks. If VRAM is limited to 256GB-384GB, DeepSeek-V3 or a heavily quantized Llama 405B are more realistic targets.