Moonshot AI

Kimi K2 Instruct 0905

Updated version of Kimi K2 released September 2025. Significant improvements in agentic coding intelligence and frontend coding. Context window increased from 128K to 256K tokens. Same 1T MoE architecture with 32B active parameters.

1000B paramsMoE256K ctx

View on Hugging Face

Run with Ollama Source Code Official Page

Our Take

Best for: Strongest at evasive-language detection in its size class

A solid 1000B-parameter MoE language model from Moonshot AI. Pulls ahead on evasive-language detection (67/100), so reach for it when that's the dimension that matters. Currently trending on Hugging Face — community interest is climbing.

Run this onAcer Veriton GN100 AI MiniCheapest card in our directory with comfortable headroom (128 GB) for this model at Q4 (~84.6 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Reasoning

Function Calling

Math

Instruction Following

Model Specifications

Parameters1000B

Active Params32B

ArchitectureMoE

Context Length256K tokens

ModalityText Only

Training Cutoff2024

ProviderMoonshot AI

Download Size1.0 TB

Community

Monthly Downloads2.0M

Likes711

Last Updated3 months ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run kimi-k2:1t-cloud

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Modified MIT LicenseView Full License

Performance & Scoring

Benchmarks

EvasionBench

66.7

Arena Score

81.9

Overall Score

59.0BB

Benchmark40%

74.3

Popularity25%

58.7

Efficiency20%

29.5

Versatility15%

58.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	77.9 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	84.6 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	87.8 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	91.6 GB	Excellent	Near-lossless quality with manageable size
Q8_0	99.6 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	130.0 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


NVIDIA H200 SXM 141GBNVIDIA	SS	45.7 tok/s	84.6 GB
AMD Instinct MI300XAMD	SS	50.4 tok/s	84.6 GB
Google TPU v7 (Ironwood)Google	SS	70.2 tok/s	84.6 GB
NVIDIA B200 GPUNVIDIA	SS	76.1 tok/s	84.6 GB
AMD Instinct MI325XAMD	SS	57.1 tok/s	84.6 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	35.2 tok/s	84.6 GB
AMD Instinct MI355XAMD	SS	76.1 tok/s	84.6 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	67.6 tok/s	84.6 GB
Dell Pro Max with GB300Dell	SS	67.6 tok/s	84.6 GB
HP ZGX Fury AI StationHP	SS	67.6 tok/s	84.6 GB
MSI XpertStation WS300MSI	SS	67.6 tok/s	84.6 GB
SuperMicro Super AI StationSuperMicro	SS	67.6 tok/s	84.6 GB
Gigabyte W775-V10-L01Gigabyte	SS	67.6 tok/s	84.6 GB
Google Cloud TPU v5pGoogle	AA	26.3 tok/s	84.6 GB
Intel Gaudi 2 AI AcceleratorIntel	AA	23.3 tok/s	84.6 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	BB	7.6 tok/s	84.6 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	7.6 tok/s	84.6 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	5.8 tok/s	84.6 GB
MacBook Pro 16-inch M5 Max (2026)Apple	BB	5.8 tok/s	84.6 GB
MacBook Pro 16" M5 Max (2026)Apple	BB	5.8 tok/s	84.6 GB
Apple M4 Max (40-core GPU)Apple	BB	5.2 tok/s	84.6 GB
Apple Mac Studio (M4 Max, 2025)Apple	BB	5.2 tok/s	84.6 GB
MacBook Pro 14-inch M4 Max (2024)Apple	BB	5.2 tok/s	84.6 GB
MacBook Pro 16" M4 Max (2024)Apple	BB	5.2 tok/s	84.6 GB
Acer Veriton GN100 AI MiniAcer	BB	2.6 tok/s	84.6 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on NVIDIA A100 SXM4 80GB (~19 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Kimi K2 Instruct 0905 on NVIDIA A100 SXM4 80GB · ~19 tok/s · 400W	$0.687
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 85 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
AMD Instinct MI300XRunPod · Community · 192 GB VRAM	$0.50
NVIDIA H200 NVLRunPod · Community · 141 GB VRAM	$0.50
NVIDIA H100 NVLVast.ai · Spot · 94 GB VRAM	$1.93
AMD Instinct MI300XRunPod · Secure · 192 GB VRAM	$1.99
AMD Instinct MI300XRunPod · Spot · 192 GB VRAM	$1.99

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

The Kimi K2 Instruct 0905 is a massive-scale Mixture of Experts (MoE) model developed by Moonshot AI. Released in September 2025 as a significant update to the original K2, this iteration pushes the model's total parameter count to 1000B (1 Trillion). Despite its enormous footprint, the model utilizes a sparse architecture that only activates 32B parameters during inference, positioning it as a direct competitor to other ultra-large-scale MoE models like DeepSeek-V3 or Grok-1.

For developers looking to run Kimi K2 Instruct 0905 locally, the appeal lies in its "agentic" intelligence. Moonshot AI has specifically tuned this version for high-end reasoning, complex function-calling, and advanced frontend coding. With a doubled context window of 256,000 tokens, it is built for long-form document analysis and multi-file codebase manipulation that smaller models struggle to maintain.

Architecture & Technical Details

The Kimi K2 Instruct 0905 architecture is a 1000B parameter Mixture of Experts (MoE) design. In a dense model, every parameter is calculated for every token generated; in this MoE setup, only a subset of "experts" is triggered for any given input.

Total Parameters: 1000B
Active Parameters: 32B
Context Window: 256K tokens
Training Cutoff: 2024
Modality: Text-only

The Kimi K2 Instruct 0905 MoE efficiency is the defining technical trait for local practitioners. Because only 32B parameters are active per token, the compute requirement (FLOPs) is significantly lower than a dense 1000B model. This results in much higher Kimi K2 Instruct 0905 tokens per second than one might expect from a "1T model." However, the memory requirement remains the primary bottleneck: all 1000B parameters must reside in VRAM or System RAM to avoid the massive latency of swapping weights from disk.

Capabilities & Use Cases

This model is engineered for complex, multi-step tasks rather than simple chat. Its "Instruct" tuning focuses on high-fidelity adherence to system prompts and structured outputs.

Advanced Coding and Frontend Development

Kimi K2 Instruct 0905 for coding excels specifically in frontend frameworks and agentic workflows. It can reason through UI/UX logic, generate complex React or Vue components, and debug across multiple files. Because of the 256K context, you can feed it entire library documentations or large portions of a repository to ensure code consistency.

Reasoning and Mathematical Logic

The updated 0905 version shows a marked improvement in Kimi K2 Instruct 0905 reasoning benchmarks, particularly in chain-of-thought tasks and mathematical proofs. It is effective for:

Function-Calling: Orchestrating multiple API calls to complete a user request.
Complex Instruction Following: Executing multi-layered prompts that require specific formatting (JSON, YAML, or custom DSLs).
Long-Context Analysis: Summarizing legal documents or technical manuals that exceed the 128K limit of previous versions.

Running Kimi K2 Instruct 0905 Locally

The primary challenge for this local AI model 1000B parameters 2025 is the VRAM footprint. Even with active parameters being low, the total weights must be stored.

Kimi K2 Instruct 0905 Hardware Requirements

To run this model, you must account for the total 1000B parameters. At 4-bit quantization (Q4_K_M), the model requires approximately 550GB to 600GB of VRAM/RAM.

Consumer GPU Setup: A single best GPU for Kimi K2 Instruct 0905 does not exist in the consumer space. You cannot run a 1000B model on a single RTX 4090. To run this on consumer hardware, you would need a multi-GPU rig (e.g., 8x RTX 4090 24GB) or a high-end Mac Studio with 192GB+ of Unified Memory using heavy quantization.
Mac Silicon: An M4 Max or M2 Ultra with 192GB of RAM can run highly compressed versions (Q2_K), but performance will be slow.
Enterprise Hardware: This model is best suited for 8x H100 or A100 (80GB) nodes, where it can be run at higher precision.

VRAM and Quantization

Best quantization for Kimi K2 Instruct 0905 for most practitioners is Q4_K_M. However, if you are struggling with Kimi K2 Instruct 0905 VRAM requirements, you may need to drop to IQ2_M or Q3_K_L, though this will noticeably degrade reasoning capabilities.

Quantization	Est. VRAM Required	Recommended Hardware
Q2_K (2-bit)	~320 GB	Mac Studio (Full Spec) / 4x A6000
Q4_K_M (4-bit)	~580 GB	8x A100 80GB Node
Q8_0 (8-bit)	~1.1 TB	Multi-node Cluster

How to run 1000B model on consumer hardware

If you do not have a server farm, the only way to run Kimi K2 Instruct 0905 locally is through GGUF offloading via Ollama or llama.cpp, utilizing system RAM (DDR4/DDR5). Be warned: while this allows the model to load, the generation speed will likely be sub-1 token per second due to the memory bandwidth bottleneck of CPU-RAM communication.

How It Compares

Kimi K2 Instruct 0905 vs. DeepSeek-V3

Both use MoE architectures, but Kimi K2 0905 has a larger total parameter count (1000B vs DeepSeek's ~671B). Kimi generally offers a larger context window (256K vs 128K), making it superior for long-document tasks. However, DeepSeek-V3 often provides better price-to-performance ratios for those using specialized inference engines like vLLM.

Kimi K2 Instruct 0905 vs. Llama 3.1 405B

Llama 3.1 405B is a dense model. While it has fewer total parameters than Kimi, it requires more compute per token because every parameter is active. Kimi K2 Instruct 0905 will likely feel "snappier" in terms of time-to-first-token (TTFT) on hardware that can fit the weights, thanks to its 32B active parameter count. Choose Llama for broad general knowledge and Kimi for specific agentic coding and long-context reasoning.

Related Models

Moonshot AI

Explore the Provider

See all Moonshot AI models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Moonshot AI model we track.

Open Moonshot AI

Explore the Family

See every Kimi release

The full Kimi family leaderboard with sizes, benchmark scores, and a release timeline.

Open Kimi

Find the Best Hardware for This Model

Use our hardware calculator to find the optimal device for running this model.

Moonshot AI

Kimi K2 Instruct 0905

1000B paramsMoE256K ctx

View on Hugging Face

Run with Ollama Source Code Official Page

Our Take

Best for: Strongest at evasive-language detection in its size class

Run this onAcer Veriton GN100 AI MiniCheapest card in our directory with comfortable headroom (128 GB) for this model at Q4 (~84.6 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Reasoning

Function Calling

Math

Instruction Following

Model Specifications

Parameters1000B

Active Params32B

ArchitectureMoE

Context Length256K tokens

ModalityText Only

Training Cutoff2024

ProviderMoonshot AI

Download Size1.0 TB

Community

Monthly Downloads2.0M

Likes711

Last Updated3 months ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run kimi-k2:1t-cloud

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Modified MIT LicenseView Full License

Performance & Scoring

Benchmarks

EvasionBench

66.7

Arena Score

81.9

Overall Score

59.0BB

Benchmark40%

74.3

Popularity25%

58.7

Efficiency20%

29.5

Versatility15%

58.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	77.9 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	84.6 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	87.8 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	91.6 GB	Excellent	Near-lossless quality with manageable size
Q8_0	99.6 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	130.0 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


NVIDIA H200 SXM 141GBNVIDIA	SS	45.7 tok/s	84.6 GB
AMD Instinct MI300XAMD	SS	50.4 tok/s	84.6 GB
Google TPU v7 (Ironwood)Google	SS	70.2 tok/s	84.6 GB
NVIDIA B200 GPUNVIDIA	SS	76.1 tok/s	84.6 GB
AMD Instinct MI325XAMD	SS	57.1 tok/s	84.6 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	35.2 tok/s	84.6 GB
AMD Instinct MI355XAMD	SS	76.1 tok/s	84.6 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	67.6 tok/s	84.6 GB
Dell Pro Max with GB300Dell	SS	67.6 tok/s	84.6 GB
HP ZGX Fury AI StationHP	SS	67.6 tok/s	84.6 GB
MSI XpertStation WS300MSI	SS	67.6 tok/s	84.6 GB
SuperMicro Super AI StationSuperMicro	SS	67.6 tok/s	84.6 GB
Gigabyte W775-V10-L01Gigabyte	SS	67.6 tok/s	84.6 GB
Google Cloud TPU v5pGoogle	AA	26.3 tok/s	84.6 GB
Intel Gaudi 2 AI AcceleratorIntel	AA	23.3 tok/s	84.6 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	BB	7.6 tok/s	84.6 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	7.6 tok/s	84.6 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	5.8 tok/s	84.6 GB
MacBook Pro 16-inch M5 Max (2026)Apple	BB	5.8 tok/s	84.6 GB
MacBook Pro 16" M5 Max (2026)Apple	BB	5.8 tok/s	84.6 GB
Apple M4 Max (40-core GPU)Apple	BB	5.2 tok/s	84.6 GB
Apple Mac Studio (M4 Max, 2025)Apple	BB	5.2 tok/s	84.6 GB
MacBook Pro 14-inch M4 Max (2024)Apple	BB	5.2 tok/s	84.6 GB
MacBook Pro 16" M4 Max (2024)Apple	BB	5.2 tok/s	84.6 GB
Acer Veriton GN100 AI MiniAcer	BB	2.6 tok/s	84.6 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on NVIDIA A100 SXM4 80GB (~19 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Kimi K2 Instruct 0905 on NVIDIA A100 SXM4 80GB · ~19 tok/s · 400W	$0.687
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 85 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
AMD Instinct MI300XRunPod · Community · 192 GB VRAM	$0.50
NVIDIA H200 NVLRunPod · Community · 141 GB VRAM	$0.50
NVIDIA H100 NVLVast.ai · Spot · 94 GB VRAM	$1.93
AMD Instinct MI300XRunPod · Secure · 192 GB VRAM	$1.99
AMD Instinct MI300XRunPod · Spot · 192 GB VRAM	$1.99

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Architecture & Technical Details

Total Parameters: 1000B
Active Parameters: 32B
Context Window: 256K tokens
Training Cutoff: 2024
Modality: Text-only

Capabilities & Use Cases

This model is engineered for complex, multi-step tasks rather than simple chat. Its "Instruct" tuning focuses on high-fidelity adherence to system prompts and structured outputs.

Advanced Coding and Frontend Development

Reasoning and Mathematical Logic

The updated 0905 version shows a marked improvement in Kimi K2 Instruct 0905 reasoning benchmarks, particularly in chain-of-thought tasks and mathematical proofs. It is effective for:

Function-Calling: Orchestrating multiple API calls to complete a user request.
Complex Instruction Following: Executing multi-layered prompts that require specific formatting (JSON, YAML, or custom DSLs).
Long-Context Analysis: Summarizing legal documents or technical manuals that exceed the 128K limit of previous versions.

Running Kimi K2 Instruct 0905 Locally

The primary challenge for this local AI model 1000B parameters 2025 is the VRAM footprint. Even with active parameters being low, the total weights must be stored.

Kimi K2 Instruct 0905 Hardware Requirements

To run this model, you must account for the total 1000B parameters. At 4-bit quantization (Q4_K_M), the model requires approximately 550GB to 600GB of VRAM/RAM.

Consumer GPU Setup: A single best GPU for Kimi K2 Instruct 0905 does not exist in the consumer space. You cannot run a 1000B model on a single RTX 4090. To run this on consumer hardware, you would need a multi-GPU rig (e.g., 8x RTX 4090 24GB) or a high-end Mac Studio with 192GB+ of Unified Memory using heavy quantization.
Mac Silicon: An M4 Max or M2 Ultra with 192GB of RAM can run highly compressed versions (Q2_K), but performance will be slow.
Enterprise Hardware: This model is best suited for 8x H100 or A100 (80GB) nodes, where it can be run at higher precision.