made by agents

Moonshot AI's flagship open-source native multimodal agentic model. Built via continual pretraining on ~15T mixed visual and text tokens atop Kimi-K2-Base. Supports text, image, and video input. Features Agent Swarm for parallel multi-agent task execution. Thinking and non-thinking modes. Uses MoonViT vision encoder.
Copy and paste this command to start running the model locally.
ollama run kimi-k2.5:cloudAccess model weights, configuration files, and documentation.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 77.9 GB | Low | Aggressive quantization — smallest size, noticeable quality loss |
| Q4_K_MRecommended | 84.6 GB | Good | Best balance of size and quality for most use-cases |
| Q5_K_M | 87.8 GB | Very Good | Slightly better quality than Q4 with moderate size increase |
| Q6_K | 91.6 GB | Excellent | Near-lossless quality with manageable size |
| Q8_0 | 99.6 GB | Near Perfect | Virtually indistinguishable from full precision |
| FP16 | 130.0 GB | Full | Full 16-bit floating point — maximum quality, largest size |
See which devices can run this model and at what quality level.
NVIDIA H200 SXM 141GBNVIDIA | SS | 45.7 tok/s | 84.6 GB | |
| SS | 50.4 tok/s | 84.6 GB | ||
NVIDIA B200 GPUNVIDIA | SS | 76.1 tok/s | 84.6 GB | |
| SS | 57.1 tok/s | 84.6 GB | ||
| SS | 35.2 tok/s | 84.6 GB | ||
| SS | 76.1 tok/s | 84.6 GB | ||
Google Cloud TPU v5pGoogle | AA | 26.3 tok/s | 84.6 GB | |
| AA | 23.3 tok/s | 84.6 GB | ||
| BB | 7.6 tok/s | 84.6 GB | ||
| BB | 7.6 tok/s | 84.6 GB | ||
| BB | 5.8 tok/s | 84.6 GB | ||
| BB | 5.8 tok/s | 84.6 GB | ||
| BB | 5.8 tok/s | 84.6 GB | ||
| BB | 5.2 tok/s | 84.6 GB | ||
| BB | 5.2 tok/s | 84.6 GB | ||
| BB | 5.2 tok/s | 84.6 GB | ||
| BB | 5.2 tok/s | 84.6 GB | ||
| BB | 2.6 tok/s | 84.6 GB | ||
| BB | 7.8 tok/s | 84.6 GB | ||
| BB | 7.8 tok/s | 84.6 GB | ||
NVIDIA H100 SXM5 80GBNVIDIA | BB | 31.9 tok/s | 84.6 GB | |
| BB | 3.8 tok/s | 84.6 GB | ||
NVIDIA A100 SXM4 80GBNVIDIA | CC | 19.4 tok/s | 84.6 GB | |
| FF | 2.7 tok/s | 84.6 GB | ||
| FF | 4.1 tok/s | 84.6 GB |
Kimi K2.5 is Moonshot AI’s flagship multimodal Mixture of Experts (MoE) model designed for complex agentic workflows. With a massive 1000B total parameter count, it represents a significant shift toward "native" multimodality, trained on 15 trillion mixed visual and text tokens. Unlike models that tack on vision capabilities as an afterthought, K2.5 integrates the MoonViT vision encoder directly, allowing it to process text, images, and video natively. This model competes directly with other ultra-large-scale open weights models like DeepSeek-V3 and Llama 3.1 405B, though its MoE architecture provides a distinct advantage in inference efficiency.
For developers looking to run Kimi K2.5 locally, the primary draw is its "Thinking Mode"—a reasoning process similar to OpenAI’s o1 series—and its integrated Agent Swarm capability. This allows for parallel multi-agent execution within a single local instance. While the 1000B parameter scale sounds daunting, the MoE architecture only activates 32B parameters during any single forward pass, making it surprisingly responsive if you have the VRAM to house the full weights.
The Kimi K2.5 architecture is built on a Mixture of Experts (MoE) framework. Out of the 1000B total parameters, only 32B parameters are active per token during inference. This Kimi K2.5 MoE efficiency means that while you need significant VRAM to load the model, the compute requirements (FLOPs) during generation are more akin to a mid-sized model, resulting in faster Kimi K2.5 tokens per second than a dense 1000B model would allow.
The 256k context length is a critical feature for local practitioners. It enables the ingestion of entire codebases or long-form technical documentation without the aggressive needle-in-a-haystack degradation seen in smaller models. Because it was continually pretrained from the Kimi-K2-Base, it retains high stability in instruction-following even at the edges of its context window.
Kimi K2.5 is positioned as a "reasoning" model first. It excels in tasks where logic and multi-step planning are required, particularly in Kimi K2.5 for coding and complex mathematical derivation.
The biggest hurdle for this model is the Kimi K2.5 VRAM requirements. Even with MoE efficiency, you must fit the 1000B parameters into memory. For local practitioners, this necessitates heavy quantization or multi-GPU setups.
To run 1000B model on consumer GPUs, you must use 4-bit quantization or lower.
llama.cpp to run the GGUF versions. Ollama manages the memory offloading across multiple GPUs automatically, which is essential for a model of this scale.On a high-end multi-GPU setup, expect Kimi K2.5 tokens per second to range between 5-15 t/s depending on the active expert utilization and your NVLink configuration. In "Thinking Mode," the perceived speed will be lower as the model generates internal reasoning tokens before providing the final answer.
When evaluating local AI model 1000B parameters 2025 options, Kimi K2.5 is often compared to DeepSeek-V3 and Llama 3.1 405B.
Kimi K2.5 is the current gold standard for practitioners who need a massive, multimodal "brain" for local deployment and have the hardware budget to support a 1000B parameter footprint. Its MoE design ensures that once the VRAM hurdle is cleared, the actual user experience is remarkably fluid.