Moonshot AI

Kimi K2 Instruct

Moonshot AI's 1 trillion parameter MoE model with 32B active parameters. Trained on 15.5T tokens using the Muon optimizer. Optimized for agentic capabilities including tool use and autonomous problem-solving. Achieves 65.8% on SWE-bench Verified.

1000B paramsMoE128K ctx

View on Hugging Face

Run with Ollama Source Code Official Page

Our Take

Best for: Datacenter-tier inference at frontier quality

A workable 1000B-parameter MoE language model from Moonshot AI. A pragmatic middle-ground choice when you need open weights without a flagship-sized footprint.

Run this onApple M4 Pro (14-core CPU, 20-core GPU)Cheapest card in our directory with comfortable headroom (64 GB) for this model at Q4 (~51.8 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Reasoning

Function Calling

Math

Instruction Following

Model Specifications

Parameters1000B

Active Params32B

ArchitectureMoE

Context Length128K tokens

ModalityText Only

Training Cutoff2024

ProviderMoonshot AI

Download Size1.0 TB

Community

Monthly Downloads816.6K

Likes2.4K

Last Updated29 days ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run kimi-k2:1t-cloud

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Modified MIT LicenseView Full License

Performance & Scoring

Benchmarks

Terminal Bench

27.8

SWE-Pro

27.7

Arena Score

80.6

Overall Score

48.9CC

Benchmark40%

45.4

Popularity25%

66.1

Efficiency20%

27.9

Versatility15%

58.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	45.1 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	51.8 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	55.0 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	58.9 GB	Excellent	Near-lossless quality with manageable size
Q8_0	66.9 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	97.3 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


Google Cloud TPU v5pGoogle	SS	42.9 tok/s	51.8 GB
NVIDIA H100 SXM5 80GBNVIDIA	SS	52.0 tok/s	51.8 GB
Intel Gaudi 2 AI AcceleratorIntel	SS	38.1 tok/s	51.8 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	57.5 tok/s	51.8 GB
NVIDIA H200 SXM 141GBNVIDIA	SS	74.6 tok/s	51.8 GB
AMD Instinct MI300XAMD	SS	82.3 tok/s	51.8 GB
Google TPU v7 (Ironwood)Google	SS	114.6 tok/s	51.8 GB
NVIDIA B200 GPUNVIDIA	SS	124.3 tok/s	51.8 GB
AMD Instinct MI325XAMD	SS	93.2 tok/s	51.8 GB
NVIDIA A100 SXM4 80GBNVIDIA	SS	31.7 tok/s	51.8 GB
AMD Instinct MI355XAMD	SS	124.3 tok/s	51.8 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	110.3 tok/s	51.8 GB
Dell Pro Max with GB300Dell	SS	110.3 tok/s	51.8 GB
HP ZGX Fury AI StationHP	SS	110.3 tok/s	51.8 GB
MSI XpertStation WS300MSI	SS	110.3 tok/s	51.8 GB
SuperMicro Super AI StationSuperMicro	SS	110.3 tok/s	51.8 GB
Gigabyte W775-V10-L01Gigabyte	SS	110.3 tok/s	51.8 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	AA	12.4 tok/s	51.8 GB
Corsair AI Workstation 300 (Ryzen AI Max+ 395)Corsair	BB	8.0 tok/s	51.8 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	9.5 tok/s	51.8 GB
MacBook Pro 16-inch M5 Max (2026)Apple	BB	9.5 tok/s	51.8 GB
MacBook Pro 16" M5 Max (2026)Apple	BB	9.5 tok/s	51.8 GB
Apple Mac Studio (M2 Max, 2023)Apple	BB	6.2 tok/s	51.8 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	12.4 tok/s	51.8 GB
Apple M4 Max (40-core GPU)Apple	BB	8.5 tok/s	51.8 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on Corsair AI Workstation 300 (Ryzen AI Max 385) (~4.0 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Kimi K2 Instruct on Corsair AI Workstation 300 (Ryzen AI Max 385) · ~4.0 tok/s · 150W	$1.26
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 52 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA A100 80GB SXMVast.ai · On-Demand · 80 GB VRAM	$0.27
AMD Instinct MI300XRunPod · Community · 192 GB VRAM	$0.50
NVIDIA H200 NVLRunPod · Community · 141 GB VRAM	$0.50
NVIDIA A100 80GB SXMVast.ai · Spot · 80 GB VRAM	$0.53
NVIDIA A100 80GB PCIeVast.ai · Spot · 80 GB VRAM	$1.02

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Kimi K2 Instruct is a massive-scale Mixture of Experts (MoE) model developed by Moonshot AI. With a total parameter count of 1 trillion (1000B), it represents a significant push in the open-weights landscape toward frontier-level performance. Despite its total size, it utilizes a sparse architecture where only 32 billion parameters are active during any single forward pass. This design allows Kimi K2 to offer the reasoning depth of a trillion-parameter model while maintaining the inference latency of a much smaller dense model.

Trained on a massive 15.5 trillion token dataset using the innovative Muon optimizer, Kimi K2 Instruct is specifically tuned for agentic workflows. It competes directly with other high-parameter MoE models like DeepSeek-V3 and Grok-1. For developers looking to run Kimi K2 Instruct locally, the model offers a compelling balance: state-of-the-art performance on coding and reasoning benchmarks with an efficiency profile that makes it accessible on high-end workstation hardware.

Architecture & Technical Details

The defining characteristic of Kimi K2 Instruct is its MoE efficiency. By activating only 32B of its 1000B parameters per token, the model significantly reduces the compute requirements for inference compared to a dense 1000B model. However, the primary bottleneck for local deployment is not compute, but memory. Even with sparse activation, the entire 1000B parameter set must reside in VRAM (or system RAM) to avoid massive latency penalties during expert switching.

Key Specifications:

Total Parameters: 1000B
Active Parameters: 32B
Context Window: 128,000 tokens
Training Data: 15.5T tokens
Optimizer: Muon (distributed 2nd-order optimizer)
Modality: Text-only

The 128k context length is robust enough for large-scale document analysis and complex repository-level coding tasks. Because it was trained with the Muon optimizer, the model exhibits better convergence and stability in instruction-following compared to many first-generation MoE models.

Capabilities & Use Cases

Kimi K2 Instruct is positioned as an "agentic" model, meaning it excels at multi-step problem solving and tool interaction. Its reasoning benchmark scores are particularly high, notably achieving 65.8% on SWE-bench Verified, placing it among the top-performing models for autonomous software engineering.

Primary Use Cases:

Autonomous Coding: With its high SWE-bench score, Kimi K2 Instruct for coding is ideal for generating complex pull requests, refactoring legacy codebases, and debugging logic errors across multiple files.
Function Calling & Tool Use: The model is highly optimized for JSON output and API orchestration. It can reliably determine when to call external tools to solve a user query.
Mathematical Reasoning: The 1000B scale provides the "world knowledge" and logical depth required for complex theorem proving and mathematical modeling.
Long-Context Synthesis: The 128k window allows practitioners to feed entire technical manuals or code repositories into the prompt for summarization or specific data extraction.

Running Kimi K2 Instruct Locally

The primary challenge for this model is the Kimi K2 Instruct hardware requirements. Because the model has 1000B parameters, the VRAM footprint is substantial, even before accounting for the KV cache.

Kimi K2 Instruct VRAM Requirements

To calculate the VRAM needed, you must look at the quantization level. A 1000B model in 16-bit (FP16) would require 2TB of VRAM, which is impossible on consumer hardware. Quantization is mandatory for local use.

Q4_K_M (4-bit): ~580GB - 620GB VRAM
Q3_K_L (3-bit): ~420GB - 450GB VRAM
Q2_K (2-bit): ~300GB - 330GB VRAM

Recommended Hardware

Running a local AI model with 1000B parameters in 2025 requires a distributed setup or a high-memory workstation:

The Mac Studio/Mac Pro Route: An M4 Max or M2 Ultra with 192GB of Unified Memory is insufficient for the full 1000B model at 4-bit. You would need a multi-node setup or a future Mac Studio with 512GB+ of Unified Memory to run Q4 quantization.
The Multi-GPU Workstation: A build featuring 8x RTX 6000 Ada (48GB each) provides 384GB of VRAM, which can comfortably run Kimi K2 at 2.5-bit or 3-bit quantization.
Consumer GPU Limitations: You cannot run a 1000B model on a single RTX 4090. Even a 4x 4090 setup (96GB VRAM) is insufficient. To run 1000B model on consumer GPUs, you would need a cluster of at least 14-16 RTX 4090s or 3090s using a framework like vLLM or DeepSpeed to shard the model.

Best Quantization for Kimi K2 Instruct

For most practitioners, Q4_K_M is the gold standard for maintaining intelligence. However, given the 1000B scale, IQ4_XS or even Q3_K_L are often more practical. The "intelligence collapse" typically seen in smaller models at 3-bit is less pronounced in 1000B models, meaning a 3-bit Kimi K2 will still outperform a 4-bit 70B model.

Performance and Setup

The Kimi K2 Instruct tokens per second (t/s) will vary based on your interconnect (NVLink vs. PCIe Gen4/5). On a well-optimized 8x GPU setup, expect 5-12 t/s.

Ollama remains the most accessible entry point for local testing. Once you have the necessary memory pool, you can run:

ollama run kimi-k2

(Note: Ensure your backend is configured for multi-GPU memory pooling).

How It Compares

When evaluating Kimi K2 Instruct, the most relevant comparisons are DeepSeek-V3 and Grok-1, both of which utilize MoE architectures.

Kimi K2 Instruct vs. DeepSeek-V3: DeepSeek-V3 is a 671B model. Kimi K2 is significantly larger at 1000B. While DeepSeek-V3 is highly optimized for coding, Kimi K2’s higher total parameter count gives it a slight edge in "common sense" reasoning and complex instruction following in non-technical domains.
Kimi K2 Instruct vs. Llama 3.1 405B: Llama 3.1 is a dense model. While it has fewer total parameters, its hardware requirements for inference are actually lower because you aren't storing "inactive" experts. However, Kimi K2 will generally offer faster inference speeds (tokens per second) than the 405B dense model because it only activates 32B parameters per token.

Comparison Table

Feature	Kimi K2 Instruct	DeepSeek-V3	Llama 3.1 405B
Architecture	MoE (1000B)	MoE (671B)	Dense (405B)
Active Params	32B	37B	405B
Context	128k	128k	128k
Best For	Agents & Reasoning	Coding & Logic	General Purpose
Min VRAM (Q4)	~600GB	~400GB	~230GB

For users with massive VRAM availability, Kimi K2 Instruct is the superior choice for autonomous agent tasks. If VRAM is limited to 256GB-384GB, DeepSeek-V3 or a heavily quantized Llama 405B are more realistic targets.

Related Models

Moonshot AI

Explore the Provider

See all Moonshot AI models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Moonshot AI model we track.

Open Moonshot AI

Explore the Family

See every Kimi release

The full Kimi family leaderboard with sizes, benchmark scores, and a release timeline.

Open Kimi

Find the Best Hardware for This Model

Use our hardware calculator to find the optimal device for running this model.

Moonshot AI

Kimi K2 Instruct

1000B paramsMoE128K ctx

View on Hugging Face

Run with Ollama Source Code Official Page

Our Take

Best for: Datacenter-tier inference at frontier quality

A workable 1000B-parameter MoE language model from Moonshot AI. A pragmatic middle-ground choice when you need open weights without a flagship-sized footprint.

Run this onApple M4 Pro (14-core CPU, 20-core GPU)Cheapest card in our directory with comfortable headroom (64 GB) for this model at Q4 (~51.8 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Reasoning

Function Calling

Math

Instruction Following

Model Specifications

Parameters1000B

Active Params32B

ArchitectureMoE

Context Length128K tokens

ModalityText Only

Training Cutoff2024

ProviderMoonshot AI

Download Size1.0 TB

Community

Monthly Downloads816.6K

Likes2.4K

Last Updated29 days ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run kimi-k2:1t-cloud

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Modified MIT LicenseView Full License

Performance & Scoring

Benchmarks

Terminal Bench

27.8

SWE-Pro

27.7

Arena Score

80.6

Overall Score

48.9CC

Benchmark40%

45.4

Popularity25%

66.1

Efficiency20%

27.9

Versatility15%

58.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	45.1 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	51.8 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	55.0 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	58.9 GB	Excellent	Near-lossless quality with manageable size
Q8_0	66.9 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	97.3 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


Google Cloud TPU v5pGoogle	SS	42.9 tok/s	51.8 GB
NVIDIA H100 SXM5 80GBNVIDIA	SS	52.0 tok/s	51.8 GB
Intel Gaudi 2 AI AcceleratorIntel	SS	38.1 tok/s	51.8 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	57.5 tok/s	51.8 GB
NVIDIA H200 SXM 141GBNVIDIA	SS	74.6 tok/s	51.8 GB
AMD Instinct MI300XAMD	SS	82.3 tok/s	51.8 GB
Google TPU v7 (Ironwood)Google	SS	114.6 tok/s	51.8 GB
NVIDIA B200 GPUNVIDIA	SS	124.3 tok/s	51.8 GB
AMD Instinct MI325XAMD	SS	93.2 tok/s	51.8 GB
NVIDIA A100 SXM4 80GBNVIDIA	SS	31.7 tok/s	51.8 GB
AMD Instinct MI355XAMD	SS	124.3 tok/s	51.8 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	110.3 tok/s	51.8 GB
Dell Pro Max with GB300Dell	SS	110.3 tok/s	51.8 GB
HP ZGX Fury AI StationHP	SS	110.3 tok/s	51.8 GB
MSI XpertStation WS300MSI	SS	110.3 tok/s	51.8 GB
SuperMicro Super AI StationSuperMicro	SS	110.3 tok/s	51.8 GB
Gigabyte W775-V10-L01Gigabyte	SS	110.3 tok/s	51.8 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	AA	12.4 tok/s	51.8 GB
Corsair AI Workstation 300 (Ryzen AI Max+ 395)Corsair	BB	8.0 tok/s	51.8 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	9.5 tok/s	51.8 GB
MacBook Pro 16-inch M5 Max (2026)Apple	BB	9.5 tok/s	51.8 GB
MacBook Pro 16" M5 Max (2026)Apple	BB	9.5 tok/s	51.8 GB
Apple Mac Studio (M2 Max, 2023)Apple	BB	6.2 tok/s	51.8 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	12.4 tok/s	51.8 GB
Apple M4 Max (40-core GPU)Apple	BB	8.5 tok/s	51.8 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on Corsair AI Workstation 300 (Ryzen AI Max 385) (~4.0 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Kimi K2 Instruct on Corsair AI Workstation 300 (Ryzen AI Max 385) · ~4.0 tok/s · 150W	$1.26
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 52 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA A100 80GB SXMVast.ai · On-Demand · 80 GB VRAM	$0.27
AMD Instinct MI300XRunPod · Community · 192 GB VRAM	$0.50
NVIDIA H200 NVLRunPod · Community · 141 GB VRAM	$0.50
NVIDIA A100 80GB SXMVast.ai · Spot · 80 GB VRAM	$0.53
NVIDIA A100 80GB PCIeVast.ai · Spot · 80 GB VRAM	$1.02

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Architecture & Technical Details

Key Specifications:

Total Parameters: 1000B
Active Parameters: 32B
Context Window: 128,000 tokens
Training Data: 15.5T tokens
Optimizer: Muon (distributed 2nd-order optimizer)
Modality: Text-only

Capabilities & Use Cases

Primary Use Cases:

Autonomous Coding: With its high SWE-bench score, Kimi K2 Instruct for coding is ideal for generating complex pull requests, refactoring legacy codebases, and debugging logic errors across multiple files.
Function Calling & Tool Use: The model is highly optimized for JSON output and API orchestration. It can reliably determine when to call external tools to solve a user query.
Mathematical Reasoning: The 1000B scale provides the "world knowledge" and logical depth required for complex theorem proving and mathematical modeling.
Long-Context Synthesis: The 128k window allows practitioners to feed entire technical manuals or code repositories into the prompt for summarization or specific data extraction.

Running Kimi K2 Instruct Locally

Kimi K2 Instruct VRAM Requirements

Q4_K_M (4-bit): ~580GB - 620GB VRAM
Q3_K_L (3-bit): ~420GB - 450GB VRAM
Q2_K (2-bit): ~300GB - 330GB VRAM

Recommended Hardware

Running a local AI model with 1000B parameters in 2025 requires a distributed setup or a high-memory workstation:

The Mac Studio/Mac Pro Route: An M4 Max or M2 Ultra with 192GB of Unified Memory is insufficient for the full 1000B model at 4-bit. You would need a multi-node setup or a future Mac Studio with 512GB+ of Unified Memory to run Q4 quantization.
The Multi-GPU Workstation: A build featuring 8x RTX 6000 Ada (48GB each) provides 384GB of VRAM, which can comfortably run Kimi K2 at 2.5-bit or 3-bit quantization.
Consumer GPU Limitations: You cannot run a 1000B model on a single RTX 4090. Even a 4x 4090 setup (96GB VRAM) is insufficient. To run 1000B model on consumer GPUs, you would need a cluster of at least 14-16 RTX 4090s or 3090s using a framework like vLLM or DeepSpeed to shard the model.

Best Quantization for Kimi K2 Instruct

Performance and Setup

The Kimi K2 Instruct tokens per second (t/s) will vary based on your interconnect (NVLink vs. PCIe Gen4/5). On a well-optimized 8x GPU setup, expect 5-12 t/s.

Ollama remains the most accessible entry point for local testing. Once you have the necessary memory pool, you can run:

ollama run kimi-k2

(Note: Ensure your backend is configured for multi-GPU memory pooling).

How It Compares

When evaluating Kimi K2 Instruct, the most relevant comparisons are DeepSeek-V3 and Grok-1, both of which utilize MoE architectures.

Kimi K2 Instruct vs. DeepSeek-V3: DeepSeek-V3 is a 671B model. Kimi K2 is significantly larger at 1000B. While DeepSeek-V3 is highly optimized for coding, Kimi K2’s higher total parameter count gives it a slight edge in "common sense" reasoning and complex instruction following in non-technical domains.
Kimi K2 Instruct vs. Llama 3.1 405B: Llama 3.1 is a dense model. While it has fewer total parameters, its hardware requirements for inference are actually lower because you aren't storing "inactive" experts. However, Kimi K2 will generally offer faster inference speeds (tokens per second) than the 405B dense model because it only activates 32B parameters per token.

Comparison Table

Feature	Kimi K2 Instruct	DeepSeek-V3	Llama 3.1 405B
Architecture	MoE (1000B)	MoE (671B)	Dense (405B)
Active Params	32B	37B	405B
Context	128k	128k	128k
Best For	Agents & Reasoning	Coding & Logic	General Purpose
Min VRAM (Q4)	~600GB	~400GB	~230GB