
State-of-the-art 685B MoE model with DeepSeek Sparse Attention and scalable RL. Gold-medal in IMO 2025 and IOI 2025. Performance comparable to GPT-5.
A solid 685B-parameter MoE language model from DeepSeek. Pulls ahead on competition math (AIME 2026) (94/100), so reach for it when that's the dimension that matters.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Copy and paste this command to start running the model locally.
ollama run deepseek-v3.2:cloudAccess model weights, configuration files, and documentation.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 52.1 GB | Low | |
| Q4_K_MRecommended | 59.8 GB | Good | |
| Q5_K_M | 63.5 GB | Very Good | |
| Q6_K | 68.0 GB | Excellent | |
| Q8_0 | 77.2 GB | Near Perfect | |
| FP16 | 112.4 GB | Full |
See which devices can run this model and at what quality level.
NVIDIA H100 SXM5 80GBNVIDIA | SS | 45.1 tok/s | 59.8 GB | |
| SS | 49.8 tok/s | 59.8 GB | ||
NVIDIA H200 SXM 141GBNVIDIA | SS | 64.6 tok/s | 59.8 GB | |
Google Cloud TPU v5pGoogle | SS | 37.2 tok/s | 59.8 GB | |
| SS | 71.3 tok/s | 59.8 GB | ||
Google TPU v7 (Ironwood)Google | SS | 99.3 tok/s | 59.8 GB | |
NVIDIA B200 GPUNVIDIA | SS | 107.6 tok/s | 59.8 GB | |
| SS | 33.0 tok/s | 59.8 GB | ||
| SS | 80.7 tok/s | 59.8 GB | ||
| SS | 107.6 tok/s | 59.8 GB | ||
| SS | 95.5 tok/s | 59.8 GB | ||
| SS | 95.5 tok/s | 59.8 GB | ||
Gigabyte W775-V10-L01Gigabyte | SS | 95.5 tok/s | 59.8 GB | |
| SS | 95.5 tok/s | 59.8 GB | ||
| SS | 95.5 tok/s | 59.8 GB | ||
SuperMicro Super AI StationSuperMicro | SS | 95.5 tok/s | 59.8 GB | |
NVIDIA A100 SXM4 80GBNVIDIA | SS | 27.4 tok/s | 59.8 GB | |
| AA | 10.8 tok/s | 59.8 GB | ||
| BB | 8.3 tok/s | 59.8 GB | ||
| BB | 8.3 tok/s | 59.8 GB | ||
| BB | 8.3 tok/s | 59.8 GB | ||
| BB | 6.9 tok/s | 59.8 GB | ||
| BB | 7.3 tok/s | 59.8 GB | ||
| BB | 7.3 tok/s | 59.8 GB | ||
| BB | 7.3 tok/s | 59.8 GB |
Energy cost on Apple M4 Pro (14-core CPU, 20-core GPU) (~3.7 tok/s, Q4_K_M) vs flagship API pricing.
| Source | Cost per 1M tokens |
|---|---|
Local (energy only)DeepSeek-V3.2 on Apple M4 Pro (14-core CPU, 20-core GPU) · ~3.7 tok/s · 60W | $0.545 |
GPT-5.5OpenAI · in $5.00 · out $30.00 | $12.50 |
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00 | $11.00 |
Gemini 3.1 Flash Lite PreviewGoogle · in $0.250 · out $1.50 | $0.625 |
Grok 4.3 betaxAI · in $3.00 · out $15.00 | $6.60 |
API prices blended at 70% input / 30% output.
Hardware amortisation not included. Run the full ROI calculator for payback math.
DeepSeek-V3.2 is a state-of-the-art Mixture-of-Experts (MoE) model that represents the peak of open-weights performance as of 2025. With a total parameter count of 685B, it is designed to compete directly with frontier models like GPT-5 and Claude 3.5 Sonnet. Despite its massive total scale, the model utilizes a highly optimized MoE architecture where only 37B parameters are active during any single inference step, striking a unique balance between extreme reasoning capabilities and computational efficiency.
Developed by DeepSeek, this model is the culmination of advancements in scalable Reinforcement Learning (RL) and Sparse Attention mechanisms. It is specifically engineered for complex tasks that require high-level logical deduction, such as advanced mathematics and software engineering. Its performance in the IMO 2025 (International Mathematical Olympiad) and IOI 2025 (International Olympiad in Informatics) benchmarks places it in the top tier of reasoning models globally. For practitioners looking to run DeepSeek-V3.2 locally, the primary challenge is not the compute speed—thanks to the 37B active parameters—but the massive VRAM footprint required to house the 685B parameter weights.
DeepSeek-V3.2 utilizes a sophisticated Mixture-of-Experts (MoE) framework that differentiates it from dense models like Llama 3.1 405B. In a dense model, every parameter is activated for every token generated. In DeepSeek-V3.2, the 685B total parameters act as a vast knowledge base, but the router only engages 37B parameters per token. This DeepSeek-V3.2 MoE efficiency allows for much faster inference speeds (tokens per second) than a 600B+ dense model would otherwise permit, provided the hardware can accommodate the full model in memory.
The model features DeepSeek’s proprietary Sparse Attention mechanism, which reduces the computational overhead of the self-attention layer. It supports a context length of 128,000 tokens, making it suitable for analyzing entire codebases or long-form technical documentation. However, practitioners should note that at 128k context, the KV (Key-Value) cache requirements become a significant factor in total DeepSeek-V3.2 VRAM requirements, especially when using high-precision formats.
DeepSeek-V3.2 was trained on a massive dataset with a 2025 cutoff, ensuring it is up-to-date with the latest programming frameworks and mathematical research. The use of scalable RL (Reinforcement Learning) during the post-training phase has specifically tuned the model for "Chain of Thought" reasoning, allowing it to verify its own logic before outputting a final answer.
This is a text-only model optimized for high-logic environments. Unlike general-purpose chat models that focus on creative writing, DeepSeek-V3.2 is a "reasoning-first" engine.
Attempting to run DeepSeek-V3.2 locally requires a significant investment in hardware. Because it is a local AI model 685B parameters 2025 release, it pushes the boundaries of what is possible on consumer and prosumer equipment.
To host this model, the primary bottleneck is VRAM. The 685B parameters must be loaded into memory to achieve usable performance.
For a Linux-based build, the best GPU for DeepSeek-V3.2 is the NVIDIA RTX 6000 Ada (48GB) or the RTX 3090/4090 (24GB).
Due to the MoE architecture, DeepSeek-V3.2 tokens per second are surprisingly high once the model is loaded. On an optimized 8x A100 setup, you can expect 20-30 tokens per second. On a consumer multi-GPU setup (4-bit), performance will likely hover between 2-5 tokens per second due to the overhead of PCIe communication between cards.
The quickest way to deploy the model is via Ollama or vLLM.
ollama run deepseek-v3.2:685b-q4_K_M (requires massive system RAM/VRAM)DeepSeek-V3.2 sits in a rare class of models. Its most direct competitors are Llama 3.1 405B and Grok-1.
For practitioners, the choice to use DeepSeek-V3.2 over a smaller 70B model comes down to whether your use case requires "frontier-level" reasoning. If you are building a simple RAG (Retrieval-Augmented Generation) system, a 70B model is more cost-effective. If you are building an automated code-refactoring agent or a mathematical verification tool, the hardware investment for DeepSeek-V3.2 is justified by its leap in logical accuracy.