made by agents

Updated version of Kimi K2 released September 2025. Significant improvements in agentic coding intelligence and frontend coding. Context window increased from 128K to 256K tokens. Same 1T MoE architecture with 32B active parameters.
Copy and paste this command to start running the model locally.
ollama run kimi-k2:1t-cloudAccess model weights, configuration files, and documentation.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 77.9 GB | Low | Aggressive quantization — smallest size, noticeable quality loss |
| Q4_K_MRecommended | 84.6 GB | Good | Best balance of size and quality for most use-cases |
| Q5_K_M | 87.8 GB | Very Good | Slightly better quality than Q4 with moderate size increase |
| Q6_K | 91.6 GB | Excellent | Near-lossless quality with manageable size |
| Q8_0 | 99.6 GB | Near Perfect | Virtually indistinguishable from full precision |
| FP16 | 130.0 GB | Full | Full 16-bit floating point — maximum quality, largest size |
See which devices can run this model and at what quality level.
NVIDIA H200 SXM 141GBNVIDIA | SS | 45.7 tok/s | 84.6 GB | |
| SS | 50.4 tok/s | 84.6 GB | ||
NVIDIA B200 GPUNVIDIA | SS | 76.1 tok/s | 84.6 GB | |
| SS | 57.1 tok/s | 84.6 GB | ||
| SS | 35.2 tok/s | 84.6 GB | ||
| SS | 76.1 tok/s | 84.6 GB | ||
Google Cloud TPU v5pGoogle | AA | 26.3 tok/s | 84.6 GB | |
| AA | 23.3 tok/s | 84.6 GB | ||
| BB | 7.6 tok/s | 84.6 GB | ||
| BB | 7.6 tok/s | 84.6 GB | ||
| BB | 5.8 tok/s | 84.6 GB | ||
| BB | 5.8 tok/s | 84.6 GB | ||
| BB | 5.8 tok/s | 84.6 GB | ||
| BB | 5.2 tok/s | 84.6 GB | ||
| BB | 5.2 tok/s | 84.6 GB | ||
| BB | 5.2 tok/s | 84.6 GB | ||
| BB | 5.2 tok/s | 84.6 GB | ||
| BB | 2.6 tok/s | 84.6 GB | ||
| BB | 7.8 tok/s | 84.6 GB | ||
| BB | 7.8 tok/s | 84.6 GB | ||
NVIDIA H100 SXM5 80GBNVIDIA | BB | 31.9 tok/s | 84.6 GB | |
| BB | 3.8 tok/s | 84.6 GB | ||
NVIDIA A100 SXM4 80GBNVIDIA | CC | 19.4 tok/s | 84.6 GB | |
| FF | 2.7 tok/s | 84.6 GB | ||
| FF | 4.1 tok/s | 84.6 GB |
The Kimi K2 Instruct 0905 is a massive-scale Mixture of Experts (MoE) model developed by Moonshot AI. Released in September 2025 as a significant update to the original K2, this iteration pushes the model's total parameter count to 1000B (1 Trillion). Despite its enormous footprint, the model utilizes a sparse architecture that only activates 32B parameters during inference, positioning it as a direct competitor to other ultra-large-scale MoE models like DeepSeek-V3 or Grok-1.
For developers looking to run Kimi K2 Instruct 0905 locally, the appeal lies in its "agentic" intelligence. Moonshot AI has specifically tuned this version for high-end reasoning, complex function-calling, and advanced frontend coding. With a doubled context window of 256,000 tokens, it is built for long-form document analysis and multi-file codebase manipulation that smaller models struggle to maintain.
The Kimi K2 Instruct 0905 architecture is a 1000B parameter Mixture of Experts (MoE) design. In a dense model, every parameter is calculated for every token generated; in this MoE setup, only a subset of "experts" is triggered for any given input.
The Kimi K2 Instruct 0905 MoE efficiency is the defining technical trait for local practitioners. Because only 32B parameters are active per token, the compute requirement (FLOPs) is significantly lower than a dense 1000B model. This results in much higher Kimi K2 Instruct 0905 tokens per second than one might expect from a "1T model." However, the memory requirement remains the primary bottleneck: all 1000B parameters must reside in VRAM or System RAM to avoid the massive latency of swapping weights from disk.
This model is engineered for complex, multi-step tasks rather than simple chat. Its "Instruct" tuning focuses on high-fidelity adherence to system prompts and structured outputs.
Kimi K2 Instruct 0905 for coding excels specifically in frontend frameworks and agentic workflows. It can reason through UI/UX logic, generate complex React or Vue components, and debug across multiple files. Because of the 256K context, you can feed it entire library documentations or large portions of a repository to ensure code consistency.
The updated 0905 version shows a marked improvement in Kimi K2 Instruct 0905 reasoning benchmarks, particularly in chain-of-thought tasks and mathematical proofs. It is effective for:
The primary challenge for this local AI model 1000B parameters 2025 is the VRAM footprint. Even with active parameters being low, the total weights must be stored.
To run this model, you must account for the total 1000B parameters. At 4-bit quantization (Q4_K_M), the model requires approximately 550GB to 600GB of VRAM/RAM.
Best quantization for Kimi K2 Instruct 0905 for most practitioners is Q4_K_M. However, if you are struggling with Kimi K2 Instruct 0905 VRAM requirements, you may need to drop to IQ2_M or Q3_K_L, though this will noticeably degrade reasoning capabilities.
| Quantization | Est. VRAM Required | Recommended Hardware |
| :--- | :--- | :--- |
| Q2_K (2-bit) | ~320 GB | Mac Studio (Full Spec) / 4x A6000 |
| Q4_K_M (4-bit) | ~580 GB | 8x A100 80GB Node |
| Q8_0 (8-bit) | ~1.1 TB | Multi-node Cluster |
If you do not have a server farm, the only way to run Kimi K2 Instruct 0905 locally is through GGUF offloading via Ollama or llama.cpp, utilizing system RAM (DDR4/DDR5). Be warned: while this allows the model to load, the generation speed will likely be sub-1 token per second due to the memory bandwidth bottleneck of CPU-RAM communication.
Both use MoE architectures, but Kimi K2 0905 has a larger total parameter count (1000B vs DeepSeek's ~671B). Kimi generally offers a larger context window (256K vs 128K), making it superior for long-document tasks. However, DeepSeek-V3 often provides better price-to-performance ratios for those using specialized inference engines like vLLM.
Llama 3.1 405B is a dense model. While it has fewer total parameters than Kimi, it requires more compute per token because every parameter is active. Kimi K2 Instruct 0905 will likely feel "snappier" in terms of time-to-first-token (TTFT) on hardware that can fit the weights, thanks to its 32B active parameter count. Choose Llama for broad general knowledge and Kimi for specific agentic coding and long-context reasoning.