Moonshot AI

Kimi K2.6

A 1.0T parameter native multimodal agentic model utilizing MoE architecture to enable 300-agent swarm orchestration, long-horizon codebase overhauls, and 24/7 proactive execution.

1000B paramsMoE262K ctxMultimodal

View on Hugging Face

Run with Ollama Official Page

Capabilities

Chat

Code Generation

Vision

Reasoning

Function Calling

Multilingual

Math

Creative Writing

Instruction Following

Model Specifications

Parameters1000B

Active Params32B

ArchitectureMoE

Context Length262K tokens

ModalityMultimodal

Training CutoffUndisclosed (Tested on Feb 2026 knowledge benchmarks)

ProviderMoonshot AI

Download Size595.2 GB

Community

Monthly Downloads208.3K

Likes976

Last Updated2 days ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run kimi-k2.6

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Modified MIT LicenseView Full License

Performance & Scoring

Benchmarks

GPQA

90.5

SWE-Verified

80.2

HLE

34.7

AIME 2026

96.4

Terminal Bench

66.7

SWE-Pro

58.6

HMMT 2026

92.7

Arena Score

92.8

Overall Score

61.3BB

Benchmark40%

76.6

Popularity25%

43.9

Efficiency20%

27.8

Versatility15%

94.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	79.4 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	86.2 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	89.4 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	93.2 GB	Excellent	Near-lossless quality with manageable size
Q8_0	101.2 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	131.6 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

83 devices


NVIDIA H200 SXM 141GBNVIDIA	SS	44.8 tok/s	86.2 GB
AMD Instinct MI300XAMD	SS	49.5 tok/s	86.2 GB
NVIDIA B200 GPUNVIDIA	SS	74.7 tok/s	86.2 GB
AMD Instinct MI325XAMD	SS	56.1 tok/s	86.2 GB
AMD Instinct MI355XAMD	SS	74.7 tok/s	86.2 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	34.6 tok/s	86.2 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	66.3 tok/s	86.2 GB
Dell Pro Max with GB300Dell	SS	66.3 tok/s	86.2 GB
HP ZGX Fury AI StationHP	SS	66.3 tok/s	86.2 GB
MSI XpertStation WS300MSI	SS	66.3 tok/s	86.2 GB
SuperMicro Super AI StationSuperMicro	SS	66.3 tok/s	86.2 GB
Gigabyte W775-V10-L01Gigabyte	SS	66.3 tok/s	86.2 GB
Intel Gaudi 2 AI AcceleratorIntel	AA	22.9 tok/s	86.2 GB
Google Cloud TPU v5pGoogle	AA	25.8 tok/s	86.2 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	BB	7.5 tok/s	86.2 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	7.5 tok/s	86.2 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	5.7 tok/s	86.2 GB
MacBook Pro 16-inch M5 Max (2026)Apple	BB	5.7 tok/s	86.2 GB
MacBook Pro 16" M5 Max (2026)Apple	BB	5.7 tok/s	86.2 GB
Apple M4 Max (40-core GPU)Apple	BB	5.1 tok/s	86.2 GB
Apple Mac Studio (M4 Max, 2025)Apple	BB	5.1 tok/s	86.2 GB
MacBook Pro 14-inch M4 Max (2024)Apple	BB	5.1 tok/s	86.2 GB
MacBook Pro 16" M4 Max (2024)Apple	BB	5.1 tok/s	86.2 GB
Acer Veriton GN100 AI MiniAcer	BB	2.6 tok/s	86.2 GB
ASUS Ascent GX10 - 1TBASUS	BB	2.6 tok/s	86.2 GB

Rows per page

Page 1 of 4

About This Model

Kimi K2.6, developed by Moonshot AI, is a native multimodal Mixture of Experts (MoE) model designed for high-autonomy agentic workflows. With a total parameter count of 1000B (1.0T) and 32B active parameters per token, it occupies the extreme high-end of the open-weight landscape. Unlike models optimized purely for chat, K2.6 is specifically engineered for "long-horizon" tasks—complex, multi-step operations that require sustained reasoning over several hours of execution.

While many trillion-parameter models are cumbersome for local deployment, the MoE architecture of K2.6 makes it a viable candidate for high-end local workstations. It competes directly with other massive-scale open models like DeepSeek-V3 or Llama 3.1 405B, but distinguishes itself through its "Agent Swarm" orchestration. This capability allows the model to coordinate up to 300 sub-agents to perform parallelized tasks, such as performing a codebase-wide refactor or generating a full-stack application from a single vision-based prompt.

Architecture and Technical Specifications

Kimi K2.6 utilizes a sophisticated Mixture of Experts (MoE) framework. While the model contains 1000B total parameters, only 32B are activated during any single forward pass. This sparsity is the key to running Kimi K2.6 locally; it provides the reasoning depth of a trillion-parameter model with the inference latency more typical of a mid-sized dense model.

The model features a 262,144 (256K) token context window, which is essential for its primary use case: long-horizon coding. This allows practitioners to feed entire repositories or massive documentation sets into the prompt without losing coherence. Its native multimodality is handled via the MoonViT encoder, enabling the model to "see" UI layouts, diagrams, and video files directly rather than relying on a separate vision-to-text bridge.

Total Parameters: 1000B
Active Parameters: 32B
Architecture: Sparse MoE
Context Length: 262,144 tokens
License: Modified MIT License
Modality: Native Multimodal (Text + Vision)

Agentic Capabilities and Use Cases

K2.6 is not a general-purpose chatbot; it is a tool for autonomous execution. Moonshot AI has optimized the model for "proactive execution," meaning it is designed to use tools and call functions with minimal human steering over runs lasting up to 12 hours.

Long-Horizon Coding and DevOps

K2.6 excels at end-to-end codebase overhauls. It can navigate complex directory structures in languages like Rust, Go, and Python to implement features or optimize performance across multiple files. For local developers, this means the model can act as a localized "junior engineer" that handles repetitive refactoring or unit test generation across a large local context.

Coding-Driven Design

Because the model is natively multimodal, it can take a screenshot of a legacy UI or a hand-drawn wireframe and generate production-ready code. It supports generating structured layouts, interactive CSS animations, and full-stack logic, effectively bridging the gap between design and implementation in a single step.

Swarm Orchestration

The "Agent Swarm" feature allows K2.6 to decompose a massive objective—such as "build a data visualization dashboard for this CSV"—into 300 sub-tasks. It then orchestrates these agents to handle data cleaning, backend logic, and frontend styling simultaneously. This makes it particularly effective for local researchers who need to process large-scale datasets through complex, multi-step pipelines.

Running Kimi K2.6 Locally

Running a 1000B parameter model locally is a significant hardware challenge, even with MoE efficiency. The primary bottleneck is VRAM. While the active parameters (32B) dictate the inference speed, the total parameters (1000B) must still be addressed in memory, though quantization significantly lowers the barrier.

Hardware Requirements and VRAM

To run Kimi K2.6 locally, you must account for the weights and the KV cache for that 256K context window.

Minimum (Q2 Quantization): Approximately 350GB - 400GB of VRAM. This typically requires a multi-GPU setup, such as 8x RTX 4090 (24GB) or 4x A100 (80GB).
Recommended (Q4_K_M Quantization): For production-grade output, a Q4 quantization is preferred, requiring roughly 550GB - 600GB of VRAM. This is best suited for Mac Studio setups with 192GB of Unified Memory (running highly compressed) or specialized enterprise hardware like the NVIDIA H100/A100 clusters.
Consumer Workarounds: For most practitioners, running the full 1000B model on a single consumer GPU is impossible. However, using 4-bit quantization on a Mac M4 Max with 128GB+ of Unified Memory is the most accessible "prosumer" path, though performance will be throttled by memory bandwidth.

Software and Quantization

The quickest way to deploy is via Ollama, using the ollama run kimi-k2.6 command. For practitioners looking for maximum performance, using llama.cpp with GGUF or EXL2 quantizations allows for more granular control over layer offloading.

Best Quantization: Q4_K_M is the industry standard for balancing perplexity and size.
Tokens Per Second: On a well-optimized multi-GPU rig, expect 10-15 tokens per second due to the 32B active parameter count. On a single Mac M4 Max, this may drop to 2-5 tokens per second.

How It Compares

Kimi K2.6 sits in a specialized niche between general-purpose LLMs and specialized coding assistants.

Kimi K2.6 vs. Llama 3.1 405B

Llama 3.1 405B is a dense model, meaning every parameter is used for every token. While Llama may have higher "raw" knowledge density, K2.6 is significantly faster during inference due to its MoE architecture. For coding and agentic tasks, K2.6’s 256K context window and native vision give it a distinct advantage over Llama’s text-heavy focus.

Kimi K2.6 vs. DeepSeek-V3

Both models utilize MoE architectures and are strong in coding and math. However, K2.6 is specifically tuned for "proactive" agentic behavior—the ability to run for 12 hours and 4,000 steps without human intervention. While DeepSeek-V3 often wins on pure code benchmarks (like HumanEval), K2.6 is generally superior for "Agent Swarm" tasks where multi-step orchestration is required.

Kimi K2.6 vs. Kimi K2.5

The jump from 2.5 to 2.6 focuses almost entirely on the "Agentic" pillar. While K2.5 introduced the multimodal capabilities, K2.6 optimizes the logic for tool use and long-duration autonomous runs. If your workflow involves simple chat or single-file code edits, K2.5 remains a lighter, more efficient choice. For full-stack generation and swarm-based tasks, K2.6 is the necessary upgrade.

Related Models

Moonshot AI

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.

Moonshot AI

Kimi K2.6

A 1.0T parameter native multimodal agentic model utilizing MoE architecture to enable 300-agent swarm orchestration, long-horizon codebase overhauls, and 24/7 proactive execution.

1000B paramsMoE262K ctxMultimodal

View on Hugging Face

Run with Ollama Official Page

Capabilities

Chat

Code Generation

Vision

Reasoning

Function Calling

Multilingual

Math

Creative Writing

Instruction Following

Model Specifications

Parameters1000B

Active Params32B

ArchitectureMoE

Context Length262K tokens

ModalityMultimodal

Training CutoffUndisclosed (Tested on Feb 2026 knowledge benchmarks)

ProviderMoonshot AI

Download Size595.2 GB

Community

Monthly Downloads208.3K

Likes976

Last Updated2 days ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run kimi-k2.6

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Modified MIT LicenseView Full License

Performance & Scoring

Benchmarks

GPQA

90.5

SWE-Verified

80.2

HLE

34.7

AIME 2026

96.4

Terminal Bench

66.7

SWE-Pro

58.6

HMMT 2026

92.7

Arena Score

92.8

Overall Score

61.3BB

Benchmark40%

76.6

Popularity25%

43.9

Efficiency20%

27.8

Versatility15%

94.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	79.4 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	86.2 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	89.4 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	93.2 GB	Excellent	Near-lossless quality with manageable size
Q8_0	101.2 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	131.6 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

83 devices


NVIDIA H200 SXM 141GBNVIDIA	SS	44.8 tok/s	86.2 GB
AMD Instinct MI300XAMD	SS	49.5 tok/s	86.2 GB
NVIDIA B200 GPUNVIDIA	SS	74.7 tok/s	86.2 GB
AMD Instinct MI325XAMD	SS	56.1 tok/s	86.2 GB
AMD Instinct MI355XAMD	SS	74.7 tok/s	86.2 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	34.6 tok/s	86.2 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	66.3 tok/s	86.2 GB
Dell Pro Max with GB300Dell	SS	66.3 tok/s	86.2 GB
HP ZGX Fury AI StationHP	SS	66.3 tok/s	86.2 GB
MSI XpertStation WS300MSI	SS	66.3 tok/s	86.2 GB
SuperMicro Super AI StationSuperMicro	SS	66.3 tok/s	86.2 GB
Gigabyte W775-V10-L01Gigabyte	SS	66.3 tok/s	86.2 GB
Intel Gaudi 2 AI AcceleratorIntel	AA	22.9 tok/s	86.2 GB
Google Cloud TPU v5pGoogle	AA	25.8 tok/s	86.2 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	BB	7.5 tok/s	86.2 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	7.5 tok/s	86.2 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	5.7 tok/s	86.2 GB
MacBook Pro 16-inch M5 Max (2026)Apple	BB	5.7 tok/s	86.2 GB
MacBook Pro 16" M5 Max (2026)Apple	BB	5.7 tok/s	86.2 GB
Apple M4 Max (40-core GPU)Apple	BB	5.1 tok/s	86.2 GB
Apple Mac Studio (M4 Max, 2025)Apple	BB	5.1 tok/s	86.2 GB
MacBook Pro 14-inch M4 Max (2024)Apple	BB	5.1 tok/s	86.2 GB
MacBook Pro 16" M4 Max (2024)Apple	BB	5.1 tok/s	86.2 GB
Acer Veriton GN100 AI MiniAcer	BB	2.6 tok/s	86.2 GB
ASUS Ascent GX10 - 1TBASUS	BB	2.6 tok/s	86.2 GB

Rows per page

Page 1 of 4

About This Model

Architecture and Technical Specifications

Total Parameters: 1000B
Active Parameters: 32B
Architecture: Sparse MoE
Context Length: 262,144 tokens
License: Modified MIT License
Modality: Native Multimodal (Text + Vision)

Agentic Capabilities and Use Cases

Long-Horizon Coding and DevOps

Coding-Driven Design

Swarm Orchestration

Running Kimi K2.6 Locally

Hardware Requirements and VRAM

To run Kimi K2.6 locally, you must account for the weights and the KV cache for that 256K context window.

Minimum (Q2 Quantization): Approximately 350GB - 400GB of VRAM. This typically requires a multi-GPU setup, such as 8x RTX 4090 (24GB) or 4x A100 (80GB).
Recommended (Q4_K_M Quantization): For production-grade output, a Q4 quantization is preferred, requiring roughly 550GB - 600GB of VRAM. This is best suited for Mac Studio setups with 192GB of Unified Memory (running highly compressed) or specialized enterprise hardware like the NVIDIA H100/A100 clusters.
Consumer Workarounds: For most practitioners, running the full 1000B model on a single consumer GPU is impossible. However, using 4-bit quantization on a Mac M4 Max with 128GB+ of Unified Memory is the most accessible "prosumer" path, though performance will be throttled by memory bandwidth.

Software and Quantization

Best Quantization: Q4_K_M is the industry standard for balancing perplexity and size.
Tokens Per Second: On a well-optimized multi-GPU rig, expect 10-15 tokens per second due to the 32B active parameter count. On a single Mac M4 Max, this may drop to 2-5 tokens per second.