Moonshot AI

Kimi K2.7 Code

Moonshot AI's coding-focused successor to Kimi K2.6, a 1T-parameter Mixture-of-Experts model that activates 32B parameters per token across 384 experts (8 active plus 1 shared). It is multimodal with text, image, and video input, supports a 256K-token context, and always runs in thinking mode while using roughly 30% fewer reasoning tokens than K2.6. On Moonshot's own benchmarks it scores 62.0 on Kimi Code Bench v2, 53.6 on Program Bench, 35.1 on MLS Bench Lite, and 81.1 on MCP Mark Verified. Weights ship open under a Modified MIT License.

1000B paramsMoE262K ctxMultimodal

View on Hugging Face Official Page

Our Take

Best for: Strongest at graduate-level reasoning (GPQA) in its size class

A workable 1000B-parameter MoE language model from Moonshot AI. Pulls ahead on graduate-level reasoning (GPQA) (90/100), so reach for it when that's the dimension that matters. Newly released, so production-readiness is still being shaken out.

Run this onAcer Veriton GN100 AI MiniCheapest card in our directory with comfortable headroom (128 GB) for this model at Q4 (~86.2 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Code Generation

Reasoning

Function Calling

Vision

Chat

Multilingual

Instruction Following

Model Specifications

Parameters1000B

Active Params32B

ArchitectureMoE

Context Length262K tokens

ModalityMultimodal

ProviderMoonshot AI

Download Size595.2 GB

Community

Monthly Downloads502.1K

Likes987

Last Updated11 days ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Modified MIT LicenseView Full License

Performance & Scoring

Benchmarks

GPQA

89.6

HLE

32.8

AA Intelligence Index

41.9

47.5

44.7

63.1

66.3

MBA Open Score

53.6CC

Benchmark40%

55.1

Popularity25%

48.5

Efficiency20%

29.6

Versatility15%

90.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	79.4 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	86.2 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	89.4 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	93.2 GB	Excellent	Near-lossless quality with manageable size
Q8_0	101.2 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	131.6 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


NVIDIA H200 SXM 141GBNVIDIA	SS	44.8 tok/s	86.2 GB
AMD Instinct MI300XAMD	SS	49.5 tok/s	86.2 GB
Google TPU v7 (Ironwood)Google	SS	68.9 tok/s	86.2 GB
NVIDIA B200 GPUNVIDIA	SS	74.7 tok/s	86.2 GB
AMD Instinct MI325XAMD	SS	56.1 tok/s	86.2 GB
AMD Instinct MI355XAMD	SS	74.7 tok/s	86.2 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	34.6 tok/s	86.2 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	66.3 tok/s	86.2 GB
Dell Pro Max with GB300Dell	SS	66.3 tok/s	86.2 GB
HP ZGX Fury AI StationHP	SS	66.3 tok/s	86.2 GB
MSI XpertStation WS300MSI	SS	66.3 tok/s	86.2 GB
SuperMicro Super AI StationSuperMicro	SS	66.3 tok/s	86.2 GB
Gigabyte W775-V10-L01Gigabyte	SS	66.3 tok/s	86.2 GB
Intel Gaudi 2 AI AcceleratorIntel	AA	22.9 tok/s	86.2 GB
Google Cloud TPU v5pGoogle	AA	25.8 tok/s	86.2 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	BB	7.5 tok/s	86.2 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	7.5 tok/s	86.2 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	5.7 tok/s	86.2 GB
MacBook Pro 16-inch M5 Max (2026)Apple	BB	5.7 tok/s	86.2 GB
MacBook Pro 16" M5 Max (2026)Apple	BB	5.7 tok/s	86.2 GB
Apple M4 Max (40-core GPU)Apple	BB	5.1 tok/s	86.2 GB
Apple Mac Studio (M4 Max, 2025)Apple	BB	5.1 tok/s	86.2 GB
MacBook Pro 14-inch M4 Max (2024)Apple	BB	5.1 tok/s	86.2 GB
MacBook Pro 16" M4 Max (2024)Apple	BB	5.1 tok/s	86.2 GB
Acer Veriton GN100 AI MiniAcer	BB	2.6 tok/s	86.2 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on NVIDIA A100 SXM4 80GB (~19 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Kimi K2.7 Code on NVIDIA A100 SXM4 80GB · ~19 tok/s · 400W	$0.700
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 86 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
AMD Instinct MI300XRunPod · Community · 192 GB VRAM	$0.50
NVIDIA H200 NVLRunPod · Community · 141 GB VRAM	$0.50
NVIDIA H200 NVLVast.ai · Spot · 141 GB VRAM	$1.67
NVIDIA H200 SXMVast.ai · Spot · 141 GB VRAM	$1.93
NVIDIA H200 NVLVast.ai · On-Demand · 141 GB VRAM	$1.94

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

Moonshot AI’s Kimi K2.7 Code is a 1-trillion-parameter Mixture-of-Experts model built specifically for agentic coding and software engineering. It activates only 32 billion parameters per token — 32B out of 1000B total — making it a sparse, compute-efficient architecture for long-horizon tasks like end-to-end code generation, refactoring, and autonomous debugging.

As the successor to K2.6, K2.7 Code delivers substantial benchmark gains: +21.8% on Moonshot’s in-house Kimi Code Bench v2 (62.0 vs. 50.9), +11.0% on Program Bench, and +31.5% on MLS Bench Lite. It also reduces reasoning token usage by roughly 30%, getting to answers faster. The model handles text, images, and video input, supports a 262,144-token context, and always operates in thinking mode.

K2.7 Code is released under a Modified MIT License — open weights that you can inspect, fine-tune, and deploy. It’s positioned as a direct competitor to closed coding agents like GPT-5.5 and Claude Opus 4.8, but with the transparency and control that local-first practitioners demand.

Architecture & Technical Details

K2.7 Code uses a sparse MoE layout with 384 experts, of which 8 are selected per token plus 1 shared expert. The total parameter count is 1T, but only 32B are active at any inference step. This means:

Memory is the bottleneck: You need to load all 1T parameters into VRAM. At FP16 that’s ~2 TB; even at 4-bit quantization expect ~500 GB. This requires multi-GPU setups (e.g., 8× A100 80 GB or 4× H100).
Compute is light: With only 32B active parameters per forward pass, throughput can be high once the model is loaded. Tokens-per-second will be dominated by memory bandwidth and expert routing overhead.
Efficient attention: Uses Multi-head Latent Attention (MLA) and SwiGLU activation to reduce memory overhead during generation.

The model includes a 400M-parameter MoonViT vision encoder, enabling multimodal understanding for image and video inputs. The context window is 262,144 tokens — enough for large code repositories, entire documentation sets, or long conversation histories.

Key Specs

Parameter	Value
Total parameters	1T (1000B)
Active parameters per token	32B
Number of experts	384 (8 active + 1 shared)
Context length	262,144 tokens
Attention mechanism	MLA
Activation	SwiGLU
Vision encoder	MoonViT (400M)
Vocabulary	160K tokens
License	Modified MIT

Capabilities & Use Cases

K2.7 Code is specialized for agentic coding workflows — tasks that require multi-step reasoning, tool use, and interaction with the filesystem and shell. It excels at:

End-to-end software engineering: Write, debug, and refactor entire codebases. The model can spawn subagents for parallel work, fetch web documentation, and execute shell commands.
Long-horizon tasks: Automate large refactors, migrate frameworks, or patch security vulnerabilities across many files.
Function calling and MCP: Works with the Model Context Protocol (MCP) for tool integration. Scores 81.1 on MCP Mark Verified, indicating reliable agent orchestration.
Code reasoning: Always operates in thinking mode, which helps with complex logic like algorithm implementation, system design, and test generation.
Multilingual and multimodal: Supports chat in multiple languages and can analyze images (diagrams, UI mockups) and video (walkthroughs, demos).

Because it’s open weight, you can fine-tune it on your own codebase or domain-specific rules. The Kimi Code CLI (available via curl install) provides a ready‑to‑use terminal interface that wraps K2.7 Code with file I/O, search, and subagent spawning capabilities.

Running Kimi K2.7 Code Locally

Running a 1T-parameter model locally demands serious hardware. Here’s what you need to know:

VRAM Requirements

Full precision (FP16): ~2 TB VRAM. Practical only with server-grade GPU clusters (e.g., 8× H100 or 16× RTX 6000 Ada).
4-bit quantization (Q4_K_M): ~500 GB. Still requires 6–8 consumer GPUs if you have that many, or 2–4 enterprise GPUs.
2-bit quantization (Q2): ~250 GB. Feasible on 4× RTX 4090 (24 GB each) with careful offloading, but quality tradeoffs are unknown.
Active parameters only: The sparse activation (32B) helps compute but not memory — the full 1T must be resident.

Realistic Hardware Configurations

Configuration	VRAM	Feasibility
Single RTX 4090 or M4 Max (64–128 GB unified)	Not enough for Q4 or higher	❌ No
4× RTX 4090 (96 GB total)	Possible with Q2 and CPU offloading	⚠️ Marginal
1× A100 80 GB (with quantization and offloading)	80 GB	❌ No — Q2 still needs > 200 GB
8× A100 80 GB (640 GB total)	Fits Q4_K_M	✅ Yes

Recommended Approach for Most Practitioners

Unless you have a multi-GPU workstation or cloud GPU cluster, consider:

Use the Kimi Code CLI (cloud-hosted) — it provides the same agentic interface without hardware requirements.
Rent a multi-GPU instance from a provider that supports custom Docker images with vLLM or TensorRT-LLM.
Watch for smaller distilled versions — no official ones yet, but the community may release quantized or adapted models.

Performance Expectations

With a proper multi-GPU setup using 8× H100, you can expect 50–100 tokens per second for text generation, depending on batch size and context length. The 30% reduction in reasoning tokens means fewer output tokens per task compared to K2.6.

How It Compares

K2.7 Code competes directly with other open-weight coding models, but its scale is unique. Here’s how it stacks:

Model	Total / Active Parameters	Context	Coding Benchmarks (similar conditions)
Kimi K2.7 Code	1T / 32B	256K	Kimi Code Bench v2: 62.0
DeepSeek-Coder-V2	236B / 21B	128K	SWE-bench verified: ~20–25% (older)
Qwen2.5-Coder-32B	32B (dense)	128K	Aider-polyglot: ~35–40% (dense)
GPT-5.5 (closed)	–	–	69.0 on Kimi Code Bench v2
Claude Opus 4.8 (closed)	–	–	67.4 on Kimi Code Bench v2

Choose K2.7 Code if: You need a powerful open coding agent for complex, multi-file, long-context tasks and have access to multi-GPU infrastructure. Its sparse 32B active parameters keep inference fast once the model is loaded.

Choose a smaller model (e.g., Qwen2.5-Coder-32B) if: You’re limited to a single consumer GPU. You’ll lose some agentic capability but gain deployability.

Benchmark caveat: All K2.7 Code scores are from Moonshot’s own benchmarks. Independent evaluations on SWE-bench or Aider are not yet available, so treat the numbers as indicative rather than definitive.

Related Models

Moonshot AI

Explore the Provider

See all Moonshot AI models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Moonshot AI model we track.

Open Moonshot AI

Explore the Family

See every Kimi release

The full Kimi family leaderboard with sizes, benchmark scores, and a release timeline.

Open Kimi

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

1000B

Moonshot AI

Kimi K2.7 Code

1000B paramsMoE262K ctxMultimodal

View on Hugging Face Official Page

Our Take

Best for: Strongest at graduate-level reasoning (GPQA) in its size class

Run this onAcer Veriton GN100 AI MiniCheapest card in our directory with comfortable headroom (128 GB) for this model at Q4 (~86.2 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Code Generation

Reasoning

Function Calling

Vision

Chat

Multilingual

Instruction Following

Model Specifications

Parameters1000B

Active Params32B

ArchitectureMoE

Context Length262K tokens

ModalityMultimodal

ProviderMoonshot AI

Download Size595.2 GB

Community

Monthly Downloads502.1K

Likes987

Last Updated11 days ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Modified MIT LicenseView Full License

Performance & Scoring

Benchmarks

GPQA

89.6

HLE

32.8

AA Intelligence Index

41.9

47.5

44.7

63.1

66.3

MBA Open Score

53.6CC

Benchmark40%

55.1

Popularity25%

48.5

Efficiency20%

29.6

Versatility15%

90.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	79.4 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	86.2 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	89.4 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	93.2 GB	Excellent	Near-lossless quality with manageable size
Q8_0	101.2 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	131.6 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


NVIDIA H200 SXM 141GBNVIDIA	SS	44.8 tok/s	86.2 GB
AMD Instinct MI300XAMD	SS	49.5 tok/s	86.2 GB
Google TPU v7 (Ironwood)Google	SS	68.9 tok/s	86.2 GB
NVIDIA B200 GPUNVIDIA	SS	74.7 tok/s	86.2 GB
AMD Instinct MI325XAMD	SS	56.1 tok/s	86.2 GB
AMD Instinct MI355XAMD	SS	74.7 tok/s	86.2 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	34.6 tok/s	86.2 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	66.3 tok/s	86.2 GB
Dell Pro Max with GB300Dell	SS	66.3 tok/s	86.2 GB
HP ZGX Fury AI StationHP	SS	66.3 tok/s	86.2 GB
MSI XpertStation WS300MSI	SS	66.3 tok/s	86.2 GB
SuperMicro Super AI StationSuperMicro	SS	66.3 tok/s	86.2 GB
Gigabyte W775-V10-L01Gigabyte	SS	66.3 tok/s	86.2 GB
Intel Gaudi 2 AI AcceleratorIntel	AA	22.9 tok/s	86.2 GB
Google Cloud TPU v5pGoogle	AA	25.8 tok/s	86.2 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	BB	7.5 tok/s	86.2 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	7.5 tok/s	86.2 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	5.7 tok/s	86.2 GB
MacBook Pro 16-inch M5 Max (2026)Apple	BB	5.7 tok/s	86.2 GB
MacBook Pro 16" M5 Max (2026)Apple	BB	5.7 tok/s	86.2 GB
Apple M4 Max (40-core GPU)Apple	BB	5.1 tok/s	86.2 GB
Apple Mac Studio (M4 Max, 2025)Apple	BB	5.1 tok/s	86.2 GB
MacBook Pro 14-inch M4 Max (2024)Apple	BB	5.1 tok/s	86.2 GB
MacBook Pro 16" M4 Max (2024)Apple	BB	5.1 tok/s	86.2 GB
Acer Veriton GN100 AI MiniAcer	BB	2.6 tok/s	86.2 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on NVIDIA A100 SXM4 80GB (~19 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Kimi K2.7 Code on NVIDIA A100 SXM4 80GB · ~19 tok/s · 400W	$0.700
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 86 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
AMD Instinct MI300XRunPod · Community · 192 GB VRAM	$0.50
NVIDIA H200 NVLRunPod · Community · 141 GB VRAM	$0.50
NVIDIA H200 NVLVast.ai · Spot · 141 GB VRAM	$1.67
NVIDIA H200 SXMVast.ai · Spot · 141 GB VRAM	$1.93
NVIDIA H200 NVLVast.ai · On-Demand · 141 GB VRAM	$1.94

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

Architecture & Technical Details

Memory is the bottleneck: You need to load all 1T parameters into VRAM. At FP16 that’s ~2 TB; even at 4-bit quantization expect ~500 GB. This requires multi-GPU setups (e.g., 8× A100 80 GB or 4× H100).
Compute is light: With only 32B active parameters per forward pass, throughput can be high once the model is loaded. Tokens-per-second will be dominated by memory bandwidth and expert routing overhead.
Efficient attention: Uses Multi-head Latent Attention (MLA) and SwiGLU activation to reduce memory overhead during generation.

Key Specs

Parameter	Value
Total parameters	1T (1000B)
Active parameters per token	32B
Number of experts	384 (8 active + 1 shared)
Context length	262,144 tokens
Attention mechanism	MLA
Activation	SwiGLU
Vision encoder	MoonViT (400M)
Vocabulary	160K tokens
License	Modified MIT

Capabilities & Use Cases

K2.7 Code is specialized for agentic coding workflows — tasks that require multi-step reasoning, tool use, and interaction with the filesystem and shell. It excels at:

End-to-end software engineering: Write, debug, and refactor entire codebases. The model can spawn subagents for parallel work, fetch web documentation, and execute shell commands.
Long-horizon tasks: Automate large refactors, migrate frameworks, or patch security vulnerabilities across many files.
Function calling and MCP: Works with the Model Context Protocol (MCP) for tool integration. Scores 81.1 on MCP Mark Verified, indicating reliable agent orchestration.
Code reasoning: Always operates in thinking mode, which helps with complex logic like algorithm implementation, system design, and test generation.
Multilingual and multimodal: Supports chat in multiple languages and can analyze images (diagrams, UI mockups) and video (walkthroughs, demos).

Running Kimi K2.7 Code Locally

Running a 1T-parameter model locally demands serious hardware. Here’s what you need to know:

VRAM Requirements

Full precision (FP16): ~2 TB VRAM. Practical only with server-grade GPU clusters (e.g., 8× H100 or 16× RTX 6000 Ada).
4-bit quantization (Q4_K_M): ~500 GB. Still requires 6–8 consumer GPUs if you have that many, or 2–4 enterprise GPUs.
2-bit quantization (Q2): ~250 GB. Feasible on 4× RTX 4090 (24 GB each) with careful offloading, but quality tradeoffs are unknown.
Active parameters only: The sparse activation (32B) helps compute but not memory — the full 1T must be resident.

Realistic Hardware Configurations

Configuration	VRAM	Feasibility
Single RTX 4090 or M4 Max (64–128 GB unified)	Not enough for Q4 or higher	❌ No
4× RTX 4090 (96 GB total)	Possible with Q2 and CPU offloading	⚠️ Marginal
1× A100 80 GB (with quantization and offloading)	80 GB	❌ No — Q2 still needs > 200 GB
8× A100 80 GB (640 GB total)	Fits Q4_K_M	✅ Yes

Recommended Approach for Most Practitioners

Unless you have a multi-GPU workstation or cloud GPU cluster, consider:

Use the Kimi Code CLI (cloud-hosted) — it provides the same agentic interface without hardware requirements.
Rent a multi-GPU instance from a provider that supports custom Docker images with vLLM or TensorRT-LLM.
Watch for smaller distilled versions — no official ones yet, but the community may release quantized or adapted models.

Performance Expectations

How It Compares

K2.7 Code competes directly with other open-weight coding models, but its scale is unique. Here’s how it stacks:

Model	Total / Active Parameters	Context	Coding Benchmarks (similar conditions)
Kimi K2.7 Code	1T / 32B	256K	Kimi Code Bench v2: 62.0
DeepSeek-Coder-V2	236B / 21B	128K	SWE-bench verified: ~20–25% (older)
Qwen2.5-Coder-32B	32B (dense)	128K	Aider-polyglot: ~35–40% (dense)
GPT-5.5 (closed)	–	–	69.0 on Kimi Code Bench v2
Claude Opus 4.8 (closed)	–	–	67.4 on Kimi Code Bench v2

Choose a smaller model (e.g., Qwen2.5-Coder-32B) if: You’re limited to a single consumer GPU. You’ll lose some agentic capability but gain deployability.