Moonshot AI

Kimi K2 Thinking

Reasoning variant of Kimi K2, trained to interleave chain-of-thought reasoning with function calls. Sets SOTA on Humanity's Last Exam and BrowseComp. Native INT4 quantization via QAT for 2x speedup. Maintains coherent tool use across 200-300 consecutive invocations.

1000B paramsMoE256K ctx

View on Hugging Face

Run with Ollama Source Code Official Page

Our Take

Best for: Strongest at LiveCodeBench in its size class

A workable 1000B-parameter MoE language model from Moonshot AI. Pulls ahead on LiveCodeBench (85/100), so reach for it when that's the dimension that matters.

Run this onAcer Veriton GN100 AI MiniCheapest card in our directory with comfortable headroom (128 GB) for this model at Q4 (~84.6 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Reasoning

Function Calling

Math

Instruction Following

Model Specifications

Parameters1000B

Active Params32B

ArchitectureMoE

Context Length256K tokens

ModalityText Only

Training Cutoff2024

ProviderMoonshot AI

Download Size594.3 GB

Community

Monthly Downloads143.0K

Likes1.7K

Last Updated4 months ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run kimi-k2-thinking:cloud

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Modified MIT LicenseView Full License

Performance & Scoring

Benchmarks

83.8

84.8

71.3

22.3

35.7

AA Intelligence Index

32.7

85.3

42.4

31.1

68.1

66.3

87.0

MBA Open Score

53.4CC

Benchmark40%

59.2

Popularity25%

49.7

Efficiency20%

33.8

Versatility15%

70.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	77.9 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	84.6 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	87.8 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	91.6 GB	Excellent	Near-lossless quality with manageable size
Q8_0	99.6 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	130.0 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


NVIDIA H200 SXM 141GBNVIDIA	SS	45.7 tok/s	84.6 GB
AMD Instinct MI300XAMD	SS	50.4 tok/s	84.6 GB
Google TPU v7 (Ironwood)Google	SS	70.2 tok/s	84.6 GB
NVIDIA B200 GPUNVIDIA	SS	76.1 tok/s	84.6 GB
AMD Instinct MI325XAMD	SS	57.1 tok/s	84.6 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	35.2 tok/s	84.6 GB
AMD Instinct MI355XAMD	SS	76.1 tok/s	84.6 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	67.6 tok/s	84.6 GB
Dell Pro Max with GB300Dell	SS	67.6 tok/s	84.6 GB
HP ZGX Fury AI StationHP	SS	67.6 tok/s	84.6 GB
MSI XpertStation WS300MSI	SS	67.6 tok/s	84.6 GB
SuperMicro Super AI StationSuperMicro	SS	67.6 tok/s	84.6 GB
Gigabyte W775-V10-L01Gigabyte	SS	67.6 tok/s	84.6 GB
Google Cloud TPU v5pGoogle	AA	26.3 tok/s	84.6 GB
Intel Gaudi 2 AI AcceleratorIntel	AA	23.3 tok/s	84.6 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	BB	7.6 tok/s	84.6 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	7.6 tok/s	84.6 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	5.8 tok/s	84.6 GB
MacBook Pro 16-inch M5 Max (2026)Apple	BB	5.8 tok/s	84.6 GB
MacBook Pro 16" M5 Max (2026)Apple	BB	5.8 tok/s	84.6 GB
Apple M4 Max (40-core GPU)Apple	BB	5.2 tok/s	84.6 GB
Apple Mac Studio (M4 Max, 2025)Apple	BB	5.2 tok/s	84.6 GB
MacBook Pro 14-inch M4 Max (2024)Apple	BB	5.2 tok/s	84.6 GB
MacBook Pro 16" M4 Max (2024)Apple	BB	5.2 tok/s	84.6 GB
Acer Veriton GN100 AI MiniAcer	BB	2.6 tok/s	84.6 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on NVIDIA A100 SXM4 80GB (~19 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Kimi K2 Thinking on NVIDIA A100 SXM4 80GB · ~19 tok/s · 400W	$0.687
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 85 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
AMD Instinct MI300XRunPod · Community · 192 GB VRAM	$0.50
NVIDIA H200 NVLRunPod · Community · 141 GB VRAM	$0.50
NVIDIA H200 NVLVast.ai · Spot · 141 GB VRAM	$1.67
NVIDIA H200 SXMVast.ai · Spot · 141 GB VRAM	$1.93
NVIDIA H200 NVLVast.ai · On-Demand · 141 GB VRAM	$1.94

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Kimi K2 Thinking is Moonshot AI’s flagship reasoning model, designed to compete directly with the highest tier of frontier LLMs. Unlike standard chat models, K2 Thinking utilizes an advanced "Chain-of-Thought" (CoT) process, interleaving internal reasoning steps with external function calls. This architecture allows it to verify its own logic and execute complex tool-use sequences before delivering a final answer. At 1,000B parameters, it is one of the largest MoE (Mixture of Experts) models available for local deployment, specifically optimized for high-stakes reasoning, mathematics, and long-context code generation.

While the total parameter count is massive, the model utilizes a sparse architecture where only 32B parameters are active during any single inference pass. This makes Kimi K2 Thinking MoE efficiency uniquely high for its weight class; you get the knowledge breadth of a trillion-parameter model with the inference latency closer to a medium-sized dense model. It currently sets State-of-the-Art (SOTA) marks on "Humanity's Last Exam" (GPQA) and the BrowseComp benchmark, outperforming many proprietary models in complex instruction following and multi-step problem solving.

Architecture & Technical Details

Kimi K2 Thinking is built on a massive MoE framework totaling 1,000B parameters. The 32B active parameter count is the critical metric for practitioners calculating Kimi K2 Thinking tokens per second. Because only a fraction of the weights are triggered per token, the compute requirement is significantly lower than a 1,000B dense model, though the VRAM footprint remains dictated by the total parameter count unless aggressive quantization is used.

Key technical specifications include:

Total Parameters: 1,000B
Active Parameters: 32B
Context Window: 256,000 tokens
Quantization Support: Native INT4 via Quantization-Aware Training (QAT)
Training Cutoff: 2024

The 256k context window is particularly robust. Unlike models that suffer from "lost in the middle" syndrome, K2 Thinking maintains high retrieval accuracy across the entire buffer. The model was trained specifically to interleave reasoning with tool use, meaning it doesn't just call a function—it reasons about why it is calling it, evaluates the output, and adjusts its strategy in real-time.

Capabilities & Use Cases

Kimi K2 Thinking for coding is one of the primary use cases for this model. It excels at refactoring large codebases and debugging complex logic errors that require understanding dependencies across multiple files. Because it can handle 200-300 consecutive function calls without losing coherence, it is an ideal engine for autonomous coding agents and local CI/CD analysis tools.

Other high-performance use cases include:

Complex Mathematical Reasoning: Solving undergraduate-level physics and math problems that require multi-step proofs.
Long-Context Document Analysis: Summarizing and querying massive legal or technical documents up to 200k+ tokens.
Reliable Function Calling: Executing complex API chains where the model must maintain state over dozens of interactions.
Instruction Following: Adhering to strict formatting requirements (JSON, XML, or custom DSLs) without deviation.

The model's native INT4 quantization via QAT provides a 2x speedup over standard FP16 inference without the typical accuracy degradation seen in post-training quantization (PTQ). This makes it a prime candidate for users looking to run Kimi K2 Thinking locally on high-end workstation hardware.

Running Kimi K2 Thinking Locally

The primary challenge for this model is the Kimi K2 Thinking VRAM requirements. Even with its MoE efficiency, the 1,000B total parameter count necessitates significant memory overhead. You cannot run the full-weight model on a single consumer GPU.

Hardware Requirements & Quantization

To determine the best GPU for Kimi K2 Thinking, you must first decide on your quantization target.

INT4 (QAT) - Recommended: This is the most efficient way to run the model. It requires approximately 550GB to 600GB of VRAM to hold the weights and a functional KV cache.
Q2_K / Q3_K_M: For those attempting to fit the model on smaller clusters, extreme quantization can bring the VRAM requirement down to the 300GB–400GB range, though reasoning capabilities may begin to degrade.

Realistic Hardware Configurations

Multi-GPU Workstations: A 4x or 8x NVIDIA RTX 6000 Ada (48GB each) setup is the baseline for professional local inference.
Mac Studio / Mac Pro: An M2 Ultra or M4 Max with 192GB of Unified Memory can run smaller quantized versions, but for the full 1,000B experience, you will need a multi-node setup or a highly specialized Mac configuration with maximum Unified Memory.
How to run 1000B model on consumer GPU: Purely consumer-grade hardware (like a single or dual RTX 4090) is insufficient for the full model. However, you can run Kimi K2 Thinking via Ollama using 4-bit quantization if you offload layers across a cluster of 8x 3090/4090 GPUs (utilizing NVLink where possible to minimize bottlenecking).

Performance Expectations

On a high-bandwidth 8x H100 or A100 cluster, you can expect 20-30 tokens per second. On consumer-grade multi-GPU setups (e.g., 4x 4090), expect closer to 2-5 tokens per second due to PCIe bottlenecking and the sheer volume of weights being shifted during MoE routing.

How It Compares

When evaluating Kimi K2 Thinking vs DeepSeek-V3 or Llama 3.1 405B, the distinction lies in the reasoning architecture.

DeepSeek-V3: DeepSeek is often faster for general chat and standard coding tasks. However, Kimi K2 Thinking’s CoT implementation is more robust for "loop-heavy" tasks where the model must call tools, analyze results, and pivot.
Llama 3.1 405B: Llama is a dense model, meaning every parameter is active. While Llama 405B is easier to fit into VRAM (due to a smaller total count), Kimi K2 Thinking offers a broader knowledge base and more sophisticated reasoning steps at the cost of a much higher total storage/VRAM footprint.

Choose Kimi K2 Thinking if your priority is local AI model 1000B parameters 2025 state-of-the-art reasoning and you have the VRAM capacity to support a trillion-parameter MoE. If you are limited to 128GB of VRAM or less, you will likely find better performance with smaller, dense models or more aggressively pruned MoEs.

Related Models

Moonshot AI

Explore the Provider

See all Moonshot AI models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Moonshot AI model we track.

Open Moonshot AI

Explore the Family

See every Kimi release

The full Kimi family leaderboard with sizes, benchmark scores, and a release timeline.

Open Kimi

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

Moonshot AI

Kimi K2 Thinking

1000B paramsMoE256K ctx

View on Hugging Face

Run with Ollama Source Code Official Page

Our Take

Best for: Strongest at LiveCodeBench in its size class

A workable 1000B-parameter MoE language model from Moonshot AI. Pulls ahead on LiveCodeBench (85/100), so reach for it when that's the dimension that matters.

Run this onAcer Veriton GN100 AI MiniCheapest card in our directory with comfortable headroom (128 GB) for this model at Q4 (~84.6 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Reasoning

Function Calling

Math

Instruction Following

Model Specifications

Parameters1000B

Active Params32B

ArchitectureMoE

Context Length256K tokens

ModalityText Only

Training Cutoff2024

ProviderMoonshot AI

Download Size594.3 GB

Community

Monthly Downloads143.0K

Likes1.7K

Last Updated4 months ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run kimi-k2-thinking:cloud

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Modified MIT LicenseView Full License

Performance & Scoring

Benchmarks

83.8

84.8

71.3

22.3

35.7

AA Intelligence Index

32.7

85.3

42.4

31.1

68.1

66.3

87.0

MBA Open Score

53.4CC

Benchmark40%

59.2

Popularity25%

49.7

Efficiency20%

33.8

Versatility15%

70.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	77.9 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	84.6 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	87.8 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	91.6 GB	Excellent	Near-lossless quality with manageable size
Q8_0	99.6 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	130.0 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


NVIDIA H200 SXM 141GBNVIDIA	SS	45.7 tok/s	84.6 GB
AMD Instinct MI300XAMD	SS	50.4 tok/s	84.6 GB
Google TPU v7 (Ironwood)Google	SS	70.2 tok/s	84.6 GB
NVIDIA B200 GPUNVIDIA	SS	76.1 tok/s	84.6 GB
AMD Instinct MI325XAMD	SS	57.1 tok/s	84.6 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	35.2 tok/s	84.6 GB
AMD Instinct MI355XAMD	SS	76.1 tok/s	84.6 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	67.6 tok/s	84.6 GB
Dell Pro Max with GB300Dell	SS	67.6 tok/s	84.6 GB
HP ZGX Fury AI StationHP	SS	67.6 tok/s	84.6 GB
MSI XpertStation WS300MSI	SS	67.6 tok/s	84.6 GB
SuperMicro Super AI StationSuperMicro	SS	67.6 tok/s	84.6 GB
Gigabyte W775-V10-L01Gigabyte	SS	67.6 tok/s	84.6 GB
Google Cloud TPU v5pGoogle	AA	26.3 tok/s	84.6 GB
Intel Gaudi 2 AI AcceleratorIntel	AA	23.3 tok/s	84.6 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	BB	7.6 tok/s	84.6 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	7.6 tok/s	84.6 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	5.8 tok/s	84.6 GB
MacBook Pro 16-inch M5 Max (2026)Apple	BB	5.8 tok/s	84.6 GB
MacBook Pro 16" M5 Max (2026)Apple	BB	5.8 tok/s	84.6 GB
Apple M4 Max (40-core GPU)Apple	BB	5.2 tok/s	84.6 GB
Apple Mac Studio (M4 Max, 2025)Apple	BB	5.2 tok/s	84.6 GB
MacBook Pro 14-inch M4 Max (2024)Apple	BB	5.2 tok/s	84.6 GB
MacBook Pro 16" M4 Max (2024)Apple	BB	5.2 tok/s	84.6 GB
Acer Veriton GN100 AI MiniAcer	BB	2.6 tok/s	84.6 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on NVIDIA A100 SXM4 80GB (~19 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Kimi K2 Thinking on NVIDIA A100 SXM4 80GB · ~19 tok/s · 400W	$0.687
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 85 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
AMD Instinct MI300XRunPod · Community · 192 GB VRAM	$0.50
NVIDIA H200 NVLRunPod · Community · 141 GB VRAM	$0.50
NVIDIA H200 NVLVast.ai · Spot · 141 GB VRAM	$1.67
NVIDIA H200 SXMVast.ai · Spot · 141 GB VRAM	$1.93
NVIDIA H200 NVLVast.ai · On-Demand · 141 GB VRAM	$1.94

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Architecture & Technical Details

Key technical specifications include:

Total Parameters: 1,000B
Active Parameters: 32B
Context Window: 256,000 tokens
Quantization Support: Native INT4 via Quantization-Aware Training (QAT)
Training Cutoff: 2024

Capabilities & Use Cases

Other high-performance use cases include:

Complex Mathematical Reasoning: Solving undergraduate-level physics and math problems that require multi-step proofs.
Long-Context Document Analysis: Summarizing and querying massive legal or technical documents up to 200k+ tokens.
Reliable Function Calling: Executing complex API chains where the model must maintain state over dozens of interactions.
Instruction Following: Adhering to strict formatting requirements (JSON, XML, or custom DSLs) without deviation.

Running Kimi K2 Thinking Locally

Hardware Requirements & Quantization

To determine the best GPU for Kimi K2 Thinking, you must first decide on your quantization target.

INT4 (QAT) - Recommended: This is the most efficient way to run the model. It requires approximately 550GB to 600GB of VRAM to hold the weights and a functional KV cache.
Q2_K / Q3_K_M: For those attempting to fit the model on smaller clusters, extreme quantization can bring the VRAM requirement down to the 300GB–400GB range, though reasoning capabilities may begin to degrade.

Realistic Hardware Configurations

Multi-GPU Workstations: A 4x or 8x NVIDIA RTX 6000 Ada (48GB each) setup is the baseline for professional local inference.
Mac Studio / Mac Pro: An M2 Ultra or M4 Max with 192GB of Unified Memory can run smaller quantized versions, but for the full 1,000B experience, you will need a multi-node setup or a highly specialized Mac configuration with maximum Unified Memory.
How to run 1000B model on consumer GPU: Purely consumer-grade hardware (like a single or dual RTX 4090) is insufficient for the full model. However, you can run Kimi K2 Thinking via Ollama using 4-bit quantization if you offload layers across a cluster of 8x 3090/4090 GPUs (utilizing NVLink where possible to minimize bottlenecking).

Performance Expectations

How It Compares

When evaluating Kimi K2 Thinking vs DeepSeek-V3 or Llama 3.1 405B, the distinction lies in the reasoning architecture.

DeepSeek-V3: DeepSeek is often faster for general chat and standard coding tasks. However, Kimi K2 Thinking’s CoT implementation is more robust for "loop-heavy" tasks where the model must call tools, analyze results, and pivot.
Llama 3.1 405B: Llama is a dense model, meaning every parameter is active. While Llama 405B is easier to fit into VRAM (due to a smaller total count), Kimi K2 Thinking offers a broader knowledge base and more sophisticated reasoning steps at the cost of a much higher total storage/VRAM footprint.