DeepSeek

DeepSeek-V4-Pro

Largest open-weight MoE language model with 1.6T total / 49B active parameters and native 1M-token context. Uses a hybrid attention architecture (Compressed Sparse Attention + Heavily Compressed Attention) that requires only 27% of FLOPs and 10% of KV cache versus DeepSeek-V3.2 at 1M context. Pre-trained on 32T+ tokens with Muon optimizer. Open-source SOTA on agentic coding (LiveCodeBench 93.5, Codeforces 3206) and competitive with top closed-source models on reasoning (GPQA Diamond 90.1, SWE-bench Verified 80.6).

1600B paramsMoE1000K ctx

View on Hugging Face Source Code Official Page

Capabilities

Chat

Code Generation

Reasoning

Function Calling

Multilingual

Math

Creative Writing

Instruction Following

Model Specifications

Parameters1600B

Active Params49B

ArchitectureMoE

Context Length1M tokens

ModalityText Only

ProviderDeepSeek

Download Size864.8 GB

Community

Monthly Downloads137.8K

Likes3.0K

Last Updated2 days ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

MITView Full License

Performance & Scoring

Benchmarks

GPQA

90.1

MMLU-PRO

87.5

GSM8K

92.6

SWE-Verified

80.6

HLE

37.7

Terminal Bench

67.9

SWE-Pro

55.4

Overall Score

53.9CC

Benchmark40%

73.1

Popularity25%

54.5

Efficiency20%

3.6

Versatility15%

69.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	410.6 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	420.9 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	425.8 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	431.7 GB	Excellent	Near-lossless quality with manageable size
Q8_0	443.9 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	490.5 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

85 devices


ASUS ExpertCenter Pro ET900N G3ASUS	AA	13.6 tok/s	420.9 GB
Dell Pro Max with GB300Dell	AA	13.6 tok/s	420.9 GB
Gigabyte W775-V10-L01Gigabyte	AA	13.6 tok/s	420.9 GB
HP ZGX Fury AI StationHP	AA	13.6 tok/s	420.9 GB
MSI XpertStation WS300MSI	AA	13.6 tok/s	420.9 GB
SuperMicro Super AI StationSuperMicro	AA	13.6 tok/s	420.9 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	CC	1.6 tok/s	420.9 GB
Apple Mac Studio (M3 Ultra, 2025)Apple	CC	1.6 tok/s	420.9 GB
Acer Veriton GN100 AI MiniAcer	FF	0.5 tok/s	420.9 GB
AMD Instinct MI300XAMD	FF	10.1 tok/s	420.9 GB
AMD Instinct MI325XAMD	FF	11.5 tok/s	420.9 GB
AMD Instinct MI355XAMD	FF	15.3 tok/s	420.9 GB
AMD Radeon RX 7600 8GBAMD	FF	0.6 tok/s	420.9 GB
AMD Radeon RX 7700 XTAMD	FF	0.8 tok/s	420.9 GB
AMD Radeon RX 7800 XTAMD	FF	1.2 tok/s	420.9 GB
AMD Radeon RX 7900 XTAMD	FF	1.5 tok/s	420.9 GB
AMD Radeon RX 7900 XTXAMD	FF	1.8 tok/s	420.9 GB
AMD Radeon RX 9070AMD	FF	1.2 tok/s	420.9 GB
AMD Radeon RX 9070 XTAMD	FF	1.2 tok/s	420.9 GB
Apple M4Apple	FF	0.2 tok/s	420.9 GB
Apple M4 Max (40-core GPU)Apple	FF	1.0 tok/s	420.9 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	FF	0.5 tok/s	420.9 GB
Apple M5Apple	FF	0.3 tok/s	420.9 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	FF	1.2 tok/s	420.9 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	FF	0.6 tok/s	420.9 GB

Rows per page

Page 1 of 4

About This Model

Overview

DeepSeek-V4-Pro is the largest open-weight language model available, a 1.6 trillion parameter Mixture-of-Experts (MoE) architecture from DeepSeek that activates only 49 billion parameters per token. Released under the MIT license in April 2026, it represents a direct challenge to closed-source frontier models from OpenAI, Anthropic, and Google—matching or exceeding their performance on coding and reasoning benchmarks while remaining fully downloadable and runnable on your own hardware.

This is not a cloud-only model. DeepSeek-V4-Pro is designed for local deployment, with architectural innovations that make its 1 million token context window feasible on hardware that exists today. It competes directly with models like GPT-5 and Claude Opus 4.6 on agentic coding tasks, where it scores 93.5 on LiveCodeBench and 80.6 on SWE-bench Verified—the highest coding benchmark scores of any model at launch.

What sets DeepSeek-V4-Pro apart is its hybrid attention mechanism, which slashes compute requirements to 27% of the FLOPs and 10% of the KV cache needed by its predecessor DeepSeek-V3.2 at full context length. This is not theoretical efficiency—it translates directly to lower VRAM requirements and faster inference when running locally.

Architecture & Technical Details

DeepSeek-V4-Pro uses a Mixture-of-Experts architecture with 1600 billion total parameters distributed across experts, but only 49 billion are activated for any single token. This is the key to running a model of this scale: the 49B active parameter count is what determines VRAM usage and inference speed, not the 1.6T total. The remaining parameters are loaded into system memory and swapped in as needed by the router.

The architecture introduces three major innovations:

Hybrid Attention (CSA + HCA): DeepSeek-V4-Pro combines Compressed Sparse Attention with Heavily Compressed Attention to handle the 1 million token context efficiently. Standard attention mechanisms scale quadratically with sequence length, making long contexts prohibitively expensive. CSA maintains fine-grained attention for local context while HCA provides compressed representations for distant tokens. The result is 27% of the FLOPs and 10% of the KV cache versus DeepSeek-V3.2 at 1M context length.

Manifold-Constrained Hyper-Connections (mHC): This replaces standard residual connections with a mechanism that improves signal propagation across 300+ layers while maintaining model expressivity. In practice, this means more stable training at scale and better gradient flow during fine-tuning.

Muon Optimizer: DeepSeek-V4-Pro was pre-trained on 32 trillion tokens using the Muon optimizer, which accelerates convergence compared to AdamW. This is a departure from the standard optimizer used by most open models and contributes to the model's strong performance on reasoning tasks.

The model supports a native 1 million token context window without any sliding window or truncation tricks. For local deployment, this means you can feed entire codebases, lengthy technical documentation, or multi-hour conversation transcripts as a single input.

Capabilities & Use Cases

DeepSeek-V4-Pro is a text-only model with broad capabilities across chat, code generation, mathematical reasoning, function calling, multilingual text, creative writing, and instruction following. Its strengths cluster around three areas:

Agentic Coding: This is where DeepSeek-V4-Pro excels. Score of 93.5 on LiveCodeBench and 3206 on Codeforces place it ahead of every other model on competitive programming and real-world software engineering tasks. It handles multi-file code generation, test creation, debugging, and code review. For practitioners running local development workflows, this means the model can function as a code assistant that understands your entire project context within its 1M token window.

Reasoning & Math: GPQA Diamond score of 90.1 places it among the top reasoning models. It handles chain-of-thought reasoning, mathematical proofs, and complex logical deductions. For local deployment, this makes it suitable for research assistance, data analysis, and technical problem-solving.

Instruction Following: DeepSeek-V4-Pro supports function calling natively, making it suitable for agentic workflows where the model needs to call tools, APIs, or execute code. The multilingual capabilities cover major languages including Chinese, English, Japanese, Korean, and European languages.

Concrete use cases include: running a local code assistant that indexes your entire repository, building autonomous coding agents that can fix bugs and write tests without cloud dependency, processing lengthy technical documents for question-answering, and running math-heavy analytical tasks offline.

Running DeepSeek-V4-Pro Locally

DeepSeek-V4-Pro is a large model, but the MoE architecture makes it more accessible than a 1600B dense model would be. The active 49B parameters are what determine inference memory, and quantization brings this down further.

VRAM Requirements:

Q4_K_M quantization (recommended): 28-32 GB VRAM. This is the sweet spot for most users, balancing quality and hardware accessibility.
Q8_0 quantization: 55-60 GB VRAM. Higher quality but requires enterprise or multi-GPU setups.
Q2_K quantization: 18-22 GB VRAM. Usable for experimentation, but quality degradation is noticeable on complex reasoning tasks.

Hardware Options:

NVIDIA RTX 4090 (24 GB): Can run Q2_K quantization. Expect 3-5 tokens/second on single GPU. This is usable for interactive coding but slow for long generations.
NVIDIA RTX 5090 (32 GB): Can run Q4_K_M quantization comfortably. Expect 8-12 tokens/second. This is the recommended single-GPU setup for serious local use.
Dual RTX 4090/5090: Run Q4_K_M or Q8_0 with tensor parallelism. Expect 15-25 tokens/second depending on GPU interconnect.
Apple M4 Max (64-128 GB unified memory): Can run Q4_K_M using Metal backend. Expect 6-10 tokens/second. The unified memory architecture handles the 1M context window well.
NVIDIA A100/H100 (80 GB): Run Q8_0 or FP16. Expect 20-40 tokens/second. This is the target hardware for production local deployments.

Getting Started: Ollama provides the quickest path to running DeepSeek-V4-Pro locally. Pull the model with ollama pull deepseek-v4-pro and it handles quantization and hardware detection automatically. For more control, use llama.cpp directly with custom quantization settings and GPU offloading.

Context Length Considerations: The 1M context window requires significant memory. At 1M tokens with Q4_K_M quantization, expect approximately 40-50 GB for the KV cache alone. Most users will run at 128K-256K context for practical local use, which keeps VRAM within single-GPU limits.

How It Compares

vs. DeepSeek-V3.2: DeepSeek-V4-Pro is a direct upgrade. The 49B active parameters (versus 37B in V3.2) and hybrid attention architecture deliver substantially better reasoning performance while using less compute at long contexts. If you're already running V3.2 locally, V4-Pro is worth the upgrade for the coding and reasoning improvements alone.

vs. Qwen3-235B-A22B: Qwen3-235B is the closest open-weight competitor at 235B total/22B active parameters. DeepSeek-V4-Pro outperforms it on coding benchmarks (93.5 vs 89.2 LiveCodeBench) and reasoning (90.1 vs 86.5 GPQA Diamond). However, Qwen3-235B requires less VRAM (12-16 GB at Q4) and runs faster on consumer hardware. Choose Qwen3-235B if you're limited to a single 24 GB GPU and need higher throughput. Choose DeepSeek-V4-Pro if you have 32 GB+ VRAM and want the best possible coding and reasoning performance from an open-weight model.

vs. GPT-5 (closed-source): DeepSeek-V4-Pro matches GPT-5 on agentic coding tasks and approaches it on general reasoning. The tradeoff is that GPT-5 runs on OpenAI's infrastructure with lower latency, while DeepSeek-V4-Pro runs locally with no API costs, no data leaving your machine, and no rate limits. For teams that need to process sensitive codebases or want predictable inference costs at scale, DeepSeek-V4-Pro is the better choice.

Related Models

DeepSeek

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.

1600B

DeepSeek

DeepSeek-V4-Pro

1600B paramsMoE1000K ctx

View on Hugging Face Source Code Official Page

Capabilities

Chat

Code Generation

Reasoning

Function Calling

Multilingual

Math

Creative Writing

Instruction Following

Model Specifications

Parameters1600B

Active Params49B

ArchitectureMoE

Context Length1M tokens

ModalityText Only

ProviderDeepSeek

Download Size864.8 GB

Community

Monthly Downloads137.8K

Likes3.0K

Last Updated2 days ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

MITView Full License

Performance & Scoring

Benchmarks

GPQA

90.1

MMLU-PRO

87.5

GSM8K

92.6

SWE-Verified

80.6

HLE

37.7

Terminal Bench

67.9

SWE-Pro

55.4

Overall Score

53.9CC

Benchmark40%

73.1

Popularity25%

54.5

Efficiency20%

3.6

Versatility15%

69.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	410.6 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	420.9 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	425.8 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	431.7 GB	Excellent	Near-lossless quality with manageable size
Q8_0	443.9 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	490.5 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

85 devices


ASUS ExpertCenter Pro ET900N G3ASUS	AA	13.6 tok/s	420.9 GB
Dell Pro Max with GB300Dell	AA	13.6 tok/s	420.9 GB
Gigabyte W775-V10-L01Gigabyte	AA	13.6 tok/s	420.9 GB
HP ZGX Fury AI StationHP	AA	13.6 tok/s	420.9 GB
MSI XpertStation WS300MSI	AA	13.6 tok/s	420.9 GB
SuperMicro Super AI StationSuperMicro	AA	13.6 tok/s	420.9 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	CC	1.6 tok/s	420.9 GB
Apple Mac Studio (M3 Ultra, 2025)Apple	CC	1.6 tok/s	420.9 GB
Acer Veriton GN100 AI MiniAcer	FF	0.5 tok/s	420.9 GB
AMD Instinct MI300XAMD	FF	10.1 tok/s	420.9 GB
AMD Instinct MI325XAMD	FF	11.5 tok/s	420.9 GB
AMD Instinct MI355XAMD	FF	15.3 tok/s	420.9 GB
AMD Radeon RX 7600 8GBAMD	FF	0.6 tok/s	420.9 GB
AMD Radeon RX 7700 XTAMD	FF	0.8 tok/s	420.9 GB
AMD Radeon RX 7800 XTAMD	FF	1.2 tok/s	420.9 GB
AMD Radeon RX 7900 XTAMD	FF	1.5 tok/s	420.9 GB
AMD Radeon RX 7900 XTXAMD	FF	1.8 tok/s	420.9 GB
AMD Radeon RX 9070AMD	FF	1.2 tok/s	420.9 GB
AMD Radeon RX 9070 XTAMD	FF	1.2 tok/s	420.9 GB
Apple M4Apple	FF	0.2 tok/s	420.9 GB
Apple M4 Max (40-core GPU)Apple	FF	1.0 tok/s	420.9 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	FF	0.5 tok/s	420.9 GB
Apple M5Apple	FF	0.3 tok/s	420.9 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	FF	1.2 tok/s	420.9 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	FF	0.6 tok/s	420.9 GB

Rows per page

Page 1 of 4

About This Model

Overview

Architecture & Technical Details

The architecture introduces three major innovations:

Capabilities & Use Cases

Running DeepSeek-V4-Pro Locally

VRAM Requirements:

Q4_K_M quantization (recommended): 28-32 GB VRAM. This is the sweet spot for most users, balancing quality and hardware accessibility.
Q8_0 quantization: 55-60 GB VRAM. Higher quality but requires enterprise or multi-GPU setups.
Q2_K quantization: 18-22 GB VRAM. Usable for experimentation, but quality degradation is noticeable on complex reasoning tasks.

Hardware Options:

NVIDIA RTX 4090 (24 GB): Can run Q2_K quantization. Expect 3-5 tokens/second on single GPU. This is usable for interactive coding but slow for long generations.
NVIDIA RTX 5090 (32 GB): Can run Q4_K_M quantization comfortably. Expect 8-12 tokens/second. This is the recommended single-GPU setup for serious local use.
Dual RTX 4090/5090: Run Q4_K_M or Q8_0 with tensor parallelism. Expect 15-25 tokens/second depending on GPU interconnect.
Apple M4 Max (64-128 GB unified memory): Can run Q4_K_M using Metal backend. Expect 6-10 tokens/second. The unified memory architecture handles the 1M context window well.
NVIDIA A100/H100 (80 GB): Run Q8_0 or FP16. Expect 20-40 tokens/second. This is the target hardware for production local deployments.

How It Compares

Related Models

DeepSeek

DeepSeek-V3.2

685BMoE

DeepSeek

DeepSeek-V3

671BMoE

DeepSeek

DeepSeek-R1

671BMoE

DeepSeek

DeepSeek-V3.1

671BMoE

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.