Moonshot AI

Kimi K2.5

Moonshot AI's flagship open-source native multimodal agentic model. Built via continual pretraining on ~15T mixed visual and text tokens atop Kimi-K2-Base. Supports text, image, and video input. Features Agent Swarm for parallel multi-agent task execution. Thinking and non-thinking modes. Uses MoonViT vision encoder.

1000B paramsMoE256K ctxMultimodal

View on Hugging Face

Run with Ollama Source Code Official Page

Our Take

Best for: Strongest at competition math (AIME 2026) in its size class

A solid 1000B-parameter MoE language model from Moonshot AI. Pulls ahead on competition math (AIME 2026) (96/100), so reach for it when that's the dimension that matters.

Run this onAcer Veriton GN100 AI MiniCheapest card in our directory with comfortable headroom (128 GB) for this model at Q4 (~84.6 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Vision

Reasoning

Function Calling

Multilingual

Math

Instruction Following

Model Specifications

Parameters1000B

Active Params32B

ArchitectureMoE

Context Length256K tokens

ModalityMultimodal

Training CutoffApril 2024

ProviderMoonshot AI

Download Size595.2 GB

Community

Monthly Downloads1.6M

Likes2.8K

Last Updated22 days ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run kimi-k2.5:cloud

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Modified MIT LicenseView Full License

Performance & Scoring

Benchmarks

87.6

87.1

70.8

50.2

95.8

43.2

50.7

87.1

91.8

Overall Score

66.7BB

Benchmark40%

73.8

Popularity25%

74.4

Efficiency20%

26.2

Versatility15%

89.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	77.9 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	84.6 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	87.8 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	91.6 GB	Excellent	Near-lossless quality with manageable size
Q8_0	99.6 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	130.0 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


NVIDIA H200 SXM 141GBNVIDIA	SS	45.7 tok/s	84.6 GB
AMD Instinct MI300XAMD	SS	50.4 tok/s	84.6 GB
Google TPU v7 (Ironwood)Google	SS	70.2 tok/s	84.6 GB
NVIDIA B200 GPUNVIDIA	SS	76.1 tok/s	84.6 GB
AMD Instinct MI325XAMD	SS	57.1 tok/s	84.6 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	35.2 tok/s	84.6 GB
AMD Instinct MI355XAMD	SS	76.1 tok/s	84.6 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	67.6 tok/s	84.6 GB
Dell Pro Max with GB300Dell	SS	67.6 tok/s	84.6 GB
HP ZGX Fury AI StationHP	SS	67.6 tok/s	84.6 GB
MSI XpertStation WS300MSI	SS	67.6 tok/s	84.6 GB
SuperMicro Super AI StationSuperMicro	SS	67.6 tok/s	84.6 GB
Gigabyte W775-V10-L01Gigabyte	SS	67.6 tok/s	84.6 GB
Google Cloud TPU v5pGoogle	AA	26.3 tok/s	84.6 GB
Intel Gaudi 2 AI AcceleratorIntel	AA	23.3 tok/s	84.6 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	BB	7.6 tok/s	84.6 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	7.6 tok/s	84.6 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	5.8 tok/s	84.6 GB
MacBook Pro 16-inch M5 Max (2026)Apple	BB	5.8 tok/s	84.6 GB
MacBook Pro 16" M5 Max (2026)Apple	BB	5.8 tok/s	84.6 GB
Apple M4 Max (40-core GPU)Apple	BB	5.2 tok/s	84.6 GB
Apple Mac Studio (M4 Max, 2025)Apple	BB	5.2 tok/s	84.6 GB
MacBook Pro 14-inch M4 Max (2024)Apple	BB	5.2 tok/s	84.6 GB
MacBook Pro 16" M4 Max (2024)Apple	BB	5.2 tok/s	84.6 GB
Acer Veriton GN100 AI MiniAcer	BB	2.6 tok/s	84.6 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on NVIDIA A100 SXM4 80GB (~19 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Kimi K2.5 on NVIDIA A100 SXM4 80GB · ~19 tok/s · 400W	$0.687
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 85 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
AMD Instinct MI300XRunPod · Community · 192 GB VRAM	$0.50
NVIDIA H200 NVLRunPod · Community · 141 GB VRAM	$0.50
NVIDIA H100 NVLVast.ai · Spot · 94 GB VRAM	$1.93
AMD Instinct MI300XRunPod · Secure · 192 GB VRAM	$1.99
AMD Instinct MI300XRunPod · Spot · 192 GB VRAM	$1.99

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Kimi K2.5 is Moonshot AI’s flagship multimodal Mixture of Experts (MoE) model designed for complex agentic workflows. With a massive 1000B total parameter count, it represents a significant shift toward "native" multimodality, trained on 15 trillion mixed visual and text tokens. Unlike models that tack on vision capabilities as an afterthought, K2.5 integrates the MoonViT vision encoder directly, allowing it to process text, images, and video natively. This model competes directly with other ultra-large-scale open weights models like DeepSeek-V3 and Llama 3.1 405B, though its MoE architecture provides a distinct advantage in inference efficiency.

For developers looking to run Kimi K2.5 locally, the primary draw is its "Thinking Mode"—a reasoning process similar to OpenAI’s o1 series—and its integrated Agent Swarm capability. This allows for parallel multi-agent execution within a single local instance. While the 1000B parameter scale sounds daunting, the MoE architecture only activates 32B parameters during any single forward pass, making it surprisingly responsive if you have the VRAM to house the full weights.

Architecture & Technical Details

The Kimi K2.5 architecture is built on a Mixture of Experts (MoE) framework. Out of the 1000B total parameters, only 32B parameters are active per token during inference. This Kimi K2.5 MoE efficiency means that while you need significant VRAM to load the model, the compute requirements (FLOPs) during generation are more akin to a mid-sized model, resulting in faster Kimi K2.5 tokens per second than a dense 1000B model would allow.

Key Technical Specifications:

Total Parameters: 1000B
Active Parameters: 32B
Context Window: 256,000 tokens
Vision Encoder: MoonViT (Native multimodal)
Training Data: 15T tokens (Cutoff: April 2024)
Input Modalities: Text, Image, Video

The 256k context length is a critical feature for local practitioners. It enables the ingestion of entire codebases or long-form technical documentation without the aggressive needle-in-a-haystack degradation seen in smaller models. Because it was continually pretrained from the Kimi-K2-Base, it retains high stability in instruction-following even at the edges of its context window.

Capabilities & Use Cases

Kimi K2.5 is positioned as a "reasoning" model first. It excels in tasks where logic and multi-step planning are required, particularly in Kimi K2.5 for coding and complex mathematical derivation.

Multimodal Agentic Workflows: Using the Agent Swarm feature, K2.5 can decompose a single prompt into sub-tasks and execute them in parallel. This is ideal for local automation scripts where the model must call multiple functions to achieve a goal.
Advanced Vision Tasks: The MoonViT encoder allows for high-fidelity image analysis and video understanding. Practitioners use this for automated UI testing, document parsing (OCR with layout retention), and security footage summarization.
Technical Coding: It handles complex refactoring and architectural planning. Unlike smaller models that struggle with global state in a codebase, K2.5’s reasoning mode allows it to "think" through dependencies before outputting code.
Multilingual Support: While developed by Moonshot AI, the model is highly proficient in English and several European and Asian languages, making it a viable local alternative for global applications.

Running Kimi K2.5 Locally: Hardware Requirements

The biggest hurdle for this model is the Kimi K2.5 VRAM requirements. Even with MoE efficiency, you must fit the 1000B parameters into memory. For local practitioners, this necessitates heavy quantization or multi-GPU setups.

VRAM Requirements by Quantization

To run 1000B model on consumer GPUs, you must use 4-bit quantization or lower.

FP16 (Original): ~2,000 GB VRAM (Requires a massive H100/A100 cluster).
Q4_K_M (Recommended): ~550-600 GB VRAM. This is the best quantization for Kimi K2.5 for maintaining intelligence while reducing the footprint.
Q2_K: ~320-350 GB VRAM. Significant perplexity loss, but may fit on a 4x or 8x RTX 6000 Ada setup.

Recommended Hardware

The Mac Studio / Mac Pro Route: An M4 Max or M2 Ultra with 192GB of Unified Memory is the most cost-effective way to run K2.5, though you will likely need to drop to 2-bit or 3-bit quantization, which impacts reasoning benchmarks.
The Multi-GPU PC Route: A professional workstation with 8x RTX 4090s (24GB each) provides 192GB of VRAM, which is still tight for a 1000B model. For full Q4 performance, you are looking at enterprise-grade hardware like the NVIDIA L40S or A100 80GB nodes.
The Quick Start: Use Ollama or llama.cpp to run the GGUF versions. Ollama manages the memory offloading across multiple GPUs automatically, which is essential for a model of this scale.

Performance Expectations

On a high-end multi-GPU setup, expect Kimi K2.5 tokens per second to range between 5-15 t/s depending on the active expert utilization and your NVLink configuration. In "Thinking Mode," the perceived speed will be lower as the model generates internal reasoning tokens before providing the final answer.

How It Compares

When evaluating local AI model 1000B parameters 2025 options, Kimi K2.5 is often compared to DeepSeek-V3 and Llama 3.1 405B.

Kimi K2.5 vs. DeepSeek-V3: Both use MoE architectures. DeepSeek-V3 is often cited for slightly better coding performance in pure Python tasks, but Kimi K2.5’s native multimodality (vision/video) and its integrated Agent Swarm make it a superior choice for builders creating autonomous agents that need to "see" their environment.
Kimi K2.5 vs. Llama 3.1 405B: Llama 405B is a dense model, meaning it is significantly slower to run locally than K2.5’s MoE architecture. While Llama may have broader general knowledge, K2.5’s Kimi K2.5 reasoning benchmark scores in math and logic often edge out Llama in technical environments.

Kimi K2.5 is the current gold standard for practitioners who need a massive, multimodal "brain" for local deployment and have the hardware budget to support a 1000B parameter footprint. Its MoE design ensures that once the VRAM hurdle is cleared, the actual user experience is remarkably fluid.

Related Models

Moonshot AI

Explore the Provider

See all Moonshot AI models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Moonshot AI model we track.

Open Moonshot AI

Explore the Family

See every Kimi release

The full Kimi family leaderboard with sizes, benchmark scores, and a release timeline.

Open Kimi

Find the Best Hardware for This Model

Use our hardware calculator to find the optimal device for running this model.

Moonshot AI

Kimi K2.5

1000B paramsMoE256K ctxMultimodal

View on Hugging Face

Run with Ollama Source Code Official Page

Our Take

Best for: Strongest at competition math (AIME 2026) in its size class

A solid 1000B-parameter MoE language model from Moonshot AI. Pulls ahead on competition math (AIME 2026) (96/100), so reach for it when that's the dimension that matters.

Run this onAcer Veriton GN100 AI MiniCheapest card in our directory with comfortable headroom (128 GB) for this model at Q4 (~84.6 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Vision

Reasoning

Function Calling

Multilingual

Math

Instruction Following

Model Specifications

Parameters1000B

Active Params32B

ArchitectureMoE

Context Length256K tokens

ModalityMultimodal

Training CutoffApril 2024

ProviderMoonshot AI

Download Size595.2 GB

Community

Monthly Downloads1.6M

Likes2.8K

Last Updated22 days ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run kimi-k2.5:cloud

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Modified MIT LicenseView Full License

Performance & Scoring

Benchmarks

87.6

87.1

70.8

50.2

95.8

43.2

50.7

87.1

91.8

Overall Score

66.7BB

Benchmark40%

73.8

Popularity25%

74.4

Efficiency20%

26.2

Versatility15%

89.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	77.9 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	84.6 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	87.8 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	91.6 GB	Excellent	Near-lossless quality with manageable size
Q8_0	99.6 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	130.0 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


NVIDIA H200 SXM 141GBNVIDIA	SS	45.7 tok/s	84.6 GB
AMD Instinct MI300XAMD	SS	50.4 tok/s	84.6 GB
Google TPU v7 (Ironwood)Google	SS	70.2 tok/s	84.6 GB
NVIDIA B200 GPUNVIDIA	SS	76.1 tok/s	84.6 GB
AMD Instinct MI325XAMD	SS	57.1 tok/s	84.6 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	35.2 tok/s	84.6 GB
AMD Instinct MI355XAMD	SS	76.1 tok/s	84.6 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	67.6 tok/s	84.6 GB
Dell Pro Max with GB300Dell	SS	67.6 tok/s	84.6 GB
HP ZGX Fury AI StationHP	SS	67.6 tok/s	84.6 GB
MSI XpertStation WS300MSI	SS	67.6 tok/s	84.6 GB
SuperMicro Super AI StationSuperMicro	SS	67.6 tok/s	84.6 GB
Gigabyte W775-V10-L01Gigabyte	SS	67.6 tok/s	84.6 GB
Google Cloud TPU v5pGoogle	AA	26.3 tok/s	84.6 GB
Intel Gaudi 2 AI AcceleratorIntel	AA	23.3 tok/s	84.6 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	BB	7.6 tok/s	84.6 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	7.6 tok/s	84.6 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	5.8 tok/s	84.6 GB
MacBook Pro 16-inch M5 Max (2026)Apple	BB	5.8 tok/s	84.6 GB
MacBook Pro 16" M5 Max (2026)Apple	BB	5.8 tok/s	84.6 GB
Apple M4 Max (40-core GPU)Apple	BB	5.2 tok/s	84.6 GB
Apple Mac Studio (M4 Max, 2025)Apple	BB	5.2 tok/s	84.6 GB
MacBook Pro 14-inch M4 Max (2024)Apple	BB	5.2 tok/s	84.6 GB
MacBook Pro 16" M4 Max (2024)Apple	BB	5.2 tok/s	84.6 GB
Acer Veriton GN100 AI MiniAcer	BB	2.6 tok/s	84.6 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on NVIDIA A100 SXM4 80GB (~19 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Kimi K2.5 on NVIDIA A100 SXM4 80GB · ~19 tok/s · 400W	$0.687
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 85 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
AMD Instinct MI300XRunPod · Community · 192 GB VRAM	$0.50
NVIDIA H200 NVLRunPod · Community · 141 GB VRAM	$0.50
NVIDIA H100 NVLVast.ai · Spot · 94 GB VRAM	$1.93
AMD Instinct MI300XRunPod · Secure · 192 GB VRAM	$1.99
AMD Instinct MI300XRunPod · Spot · 192 GB VRAM	$1.99

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Architecture & Technical Details

Key Technical Specifications:

Total Parameters: 1000B
Active Parameters: 32B
Context Window: 256,000 tokens
Vision Encoder: MoonViT (Native multimodal)
Training Data: 15T tokens (Cutoff: April 2024)
Input Modalities: Text, Image, Video

Capabilities & Use Cases

Multimodal Agentic Workflows: Using the Agent Swarm feature, K2.5 can decompose a single prompt into sub-tasks and execute them in parallel. This is ideal for local automation scripts where the model must call multiple functions to achieve a goal.
Advanced Vision Tasks: The MoonViT encoder allows for high-fidelity image analysis and video understanding. Practitioners use this for automated UI testing, document parsing (OCR with layout retention), and security footage summarization.
Technical Coding: It handles complex refactoring and architectural planning. Unlike smaller models that struggle with global state in a codebase, K2.5’s reasoning mode allows it to "think" through dependencies before outputting code.
Multilingual Support: While developed by Moonshot AI, the model is highly proficient in English and several European and Asian languages, making it a viable local alternative for global applications.

Running Kimi K2.5 Locally: Hardware Requirements

VRAM Requirements by Quantization

To run 1000B model on consumer GPUs, you must use 4-bit quantization or lower.

FP16 (Original): ~2,000 GB VRAM (Requires a massive H100/A100 cluster).
Q4_K_M (Recommended): ~550-600 GB VRAM. This is the best quantization for Kimi K2.5 for maintaining intelligence while reducing the footprint.
Q2_K: ~320-350 GB VRAM. Significant perplexity loss, but may fit on a 4x or 8x RTX 6000 Ada setup.

Recommended Hardware

The Mac Studio / Mac Pro Route: An M4 Max or M2 Ultra with 192GB of Unified Memory is the most cost-effective way to run K2.5, though you will likely need to drop to 2-bit or 3-bit quantization, which impacts reasoning benchmarks.
The Multi-GPU PC Route: A professional workstation with 8x RTX 4090s (24GB each) provides 192GB of VRAM, which is still tight for a 1000B model. For full Q4 performance, you are looking at enterprise-grade hardware like the NVIDIA L40S or A100 80GB nodes.
The Quick Start: Use Ollama or llama.cpp to run the GGUF versions. Ollama manages the memory offloading across multiple GPUs automatically, which is essential for a model of this scale.

Performance Expectations

How It Compares

When evaluating local AI model 1000B parameters 2025 options, Kimi K2.5 is often compared to DeepSeek-V3 and Llama 3.1 405B.

Kimi K2.5 vs. DeepSeek-V3: Both use MoE architectures. DeepSeek-V3 is often cited for slightly better coding performance in pure Python tasks, but Kimi K2.5’s native multimodality (vision/video) and its integrated Agent Swarm make it a superior choice for builders creating autonomous agents that need to "see" their environment.
Kimi K2.5 vs. Llama 3.1 405B: Llama 405B is a dense model, meaning it is significantly slower to run locally than K2.5’s MoE architecture. While Llama may have broader general knowledge, K2.5’s Kimi K2.5 reasoning benchmark scores in math and logic often edge out Llama in technical environments.