Largest open-weight MoE language model with 1.6T total / 49B active parameters and native 1M-token context. Uses a hybrid attention architecture (Compressed Sparse Attention + Heavily Compressed Attention) that requires only 27% of FLOPs and 10% of KV cache versus DeepSeek-V3.2 at 1M context. Pre-trained on 32T+ tokens with Muon optimizer. Open-source SOTA on agentic coding (LiveCodeBench 93.5, Codeforces 3206) and competitive with top closed-source models on reasoning (GPQA Diamond 90.1, SWE-bench Verified 80.6).
Access model weights, configuration files, and documentation.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 410.6 GB | Low | |
| Q4_K_MRecommended | 420.9 GB | Good | |
| Q5_K_M | 425.8 GB | Very Good | |
| Q6_K | 431.7 GB | Excellent | |
| Q8_0 | 443.9 GB | Near Perfect | |
| FP16 | 490.5 GB | Full |
See which devices can run this model and at what quality level.
| AA | 13.6 tok/s | 420.9 GB | ||
| AA | 13.6 tok/s | 420.9 GB | ||
Gigabyte W775-V10-L01Gigabyte | AA | 13.6 tok/s | 420.9 GB | |
| AA | 13.6 tok/s | 420.9 GB | ||
| AA | 13.6 tok/s | 420.9 GB | ||
SuperMicro Super AI StationSuperMicro | AA | 13.6 tok/s | 420.9 GB | |
| CC | 1.6 tok/s | 420.9 GB | ||
| CC | 1.6 tok/s | 420.9 GB | ||
| FF | 0.5 tok/s | 420.9 GB | ||
| FF | 10.1 tok/s | 420.9 GB | ||
| FF | 11.5 tok/s | 420.9 GB | ||
| FF | 15.3 tok/s | 420.9 GB | ||
| FF | 0.6 tok/s | 420.9 GB | ||
| FF | 0.8 tok/s | 420.9 GB | ||
| FF | 1.2 tok/s | 420.9 GB | ||
| FF | 1.5 tok/s | 420.9 GB | ||
| FF | 1.8 tok/s | 420.9 GB | ||
| FF | 1.2 tok/s | 420.9 GB | ||
| FF | 1.2 tok/s | 420.9 GB | ||
Apple M4Apple | FF | 0.2 tok/s | 420.9 GB | |
| FF | 1.0 tok/s | 420.9 GB | ||
| FF | 0.5 tok/s | 420.9 GB | ||
Apple M5Apple | FF | 0.3 tok/s | 420.9 GB | |
| FF | 1.2 tok/s | 420.9 GB | ||
| FF | 0.6 tok/s | 420.9 GB |
DeepSeek-V4-Pro is the largest open-weight language model available, a 1.6 trillion parameter Mixture-of-Experts (MoE) architecture from DeepSeek that activates only 49 billion parameters per token. Released under the MIT license in April 2026, it represents a direct challenge to closed-source frontier models from OpenAI, Anthropic, and Google—matching or exceeding their performance on coding and reasoning benchmarks while remaining fully downloadable and runnable on your own hardware.
This is not a cloud-only model. DeepSeek-V4-Pro is designed for local deployment, with architectural innovations that make its 1 million token context window feasible on hardware that exists today. It competes directly with models like GPT-5 and Claude Opus 4.6 on agentic coding tasks, where it scores 93.5 on LiveCodeBench and 80.6 on SWE-bench Verified—the highest coding benchmark scores of any model at launch.
What sets DeepSeek-V4-Pro apart is its hybrid attention mechanism, which slashes compute requirements to 27% of the FLOPs and 10% of the KV cache needed by its predecessor DeepSeek-V3.2 at full context length. This is not theoretical efficiency—it translates directly to lower VRAM requirements and faster inference when running locally.
DeepSeek-V4-Pro uses a Mixture-of-Experts architecture with 1600 billion total parameters distributed across experts, but only 49 billion are activated for any single token. This is the key to running a model of this scale: the 49B active parameter count is what determines VRAM usage and inference speed, not the 1.6T total. The remaining parameters are loaded into system memory and swapped in as needed by the router.
The architecture introduces three major innovations:
Hybrid Attention (CSA + HCA): DeepSeek-V4-Pro combines Compressed Sparse Attention with Heavily Compressed Attention to handle the 1 million token context efficiently. Standard attention mechanisms scale quadratically with sequence length, making long contexts prohibitively expensive. CSA maintains fine-grained attention for local context while HCA provides compressed representations for distant tokens. The result is 27% of the FLOPs and 10% of the KV cache versus DeepSeek-V3.2 at 1M context length.
Manifold-Constrained Hyper-Connections (mHC): This replaces standard residual connections with a mechanism that improves signal propagation across 300+ layers while maintaining model expressivity. In practice, this means more stable training at scale and better gradient flow during fine-tuning.
Muon Optimizer: DeepSeek-V4-Pro was pre-trained on 32 trillion tokens using the Muon optimizer, which accelerates convergence compared to AdamW. This is a departure from the standard optimizer used by most open models and contributes to the model's strong performance on reasoning tasks.
The model supports a native 1 million token context window without any sliding window or truncation tricks. For local deployment, this means you can feed entire codebases, lengthy technical documentation, or multi-hour conversation transcripts as a single input.
DeepSeek-V4-Pro is a text-only model with broad capabilities across chat, code generation, mathematical reasoning, function calling, multilingual text, creative writing, and instruction following. Its strengths cluster around three areas:
Agentic Coding: This is where DeepSeek-V4-Pro excels. Score of 93.5 on LiveCodeBench and 3206 on Codeforces place it ahead of every other model on competitive programming and real-world software engineering tasks. It handles multi-file code generation, test creation, debugging, and code review. For practitioners running local development workflows, this means the model can function as a code assistant that understands your entire project context within its 1M token window.
Reasoning & Math: GPQA Diamond score of 90.1 places it among the top reasoning models. It handles chain-of-thought reasoning, mathematical proofs, and complex logical deductions. For local deployment, this makes it suitable for research assistance, data analysis, and technical problem-solving.
Instruction Following: DeepSeek-V4-Pro supports function calling natively, making it suitable for agentic workflows where the model needs to call tools, APIs, or execute code. The multilingual capabilities cover major languages including Chinese, English, Japanese, Korean, and European languages.
Concrete use cases include: running a local code assistant that indexes your entire repository, building autonomous coding agents that can fix bugs and write tests without cloud dependency, processing lengthy technical documents for question-answering, and running math-heavy analytical tasks offline.
DeepSeek-V4-Pro is a large model, but the MoE architecture makes it more accessible than a 1600B dense model would be. The active 49B parameters are what determine inference memory, and quantization brings this down further.
VRAM Requirements:
Hardware Options:
Getting Started: Ollama provides the quickest path to running DeepSeek-V4-Pro locally. Pull the model with ollama pull deepseek-v4-pro and it handles quantization and hardware detection automatically. For more control, use llama.cpp directly with custom quantization settings and GPU offloading.
Context Length Considerations: The 1M context window requires significant memory. At 1M tokens with Q4_K_M quantization, expect approximately 40-50 GB for the KV cache alone. Most users will run at 128K-256K context for practical local use, which keeps VRAM within single-GPU limits.
vs. DeepSeek-V3.2: DeepSeek-V4-Pro is a direct upgrade. The 49B active parameters (versus 37B in V3.2) and hybrid attention architecture deliver substantially better reasoning performance while using less compute at long contexts. If you're already running V3.2 locally, V4-Pro is worth the upgrade for the coding and reasoning improvements alone.
vs. Qwen3-235B-A22B: Qwen3-235B is the closest open-weight competitor at 235B total/22B active parameters. DeepSeek-V4-Pro outperforms it on coding benchmarks (93.5 vs 89.2 LiveCodeBench) and reasoning (90.1 vs 86.5 GPQA Diamond). However, Qwen3-235B requires less VRAM (12-16 GB at Q4) and runs faster on consumer hardware. Choose Qwen3-235B if you're limited to a single 24 GB GPU and need higher throughput. Choose DeepSeek-V4-Pro if you have 32 GB+ VRAM and want the best possible coding and reasoning performance from an open-weight model.
vs. GPT-5 (closed-source): DeepSeek-V4-Pro matches GPT-5 on agentic coding tasks and approaches it on general reasoning. The tradeoff is that GPT-5 runs on OpenAI's infrastructure with lower latency, while DeepSeek-V4-Pro runs locally with no API costs, no data leaving your machine, and no rate limits. For teams that need to process sensitive codebases or want predictable inference costs at scale, DeepSeek-V4-Pro is the better choice.