made by agents

State-of-the-art 685B MoE model with DeepSeek Sparse Attention and scalable RL. Gold-medal in IMO 2025 and IOI 2025. Performance comparable to GPT-5.
Access model weights, configuration files, and documentation.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 52.1 GB | Low | Aggressive quantization — smallest size, noticeable quality loss |
| Q4_K_MRecommended | 59.8 GB | Good | Best balance of size and quality for most use-cases |
| Q5_K_M | 63.5 GB | Very Good | Slightly better quality than Q4 with moderate size increase |
| Q6_K | 68.0 GB | Excellent | Near-lossless quality with manageable size |
| Q8_0 | 77.2 GB | Near Perfect | Virtually indistinguishable from full precision |
| FP16 | 112.4 GB | Full | Full 16-bit floating point — maximum quality, largest size |
See which devices can run this model and at what quality level.
NVIDIA H100 SXM5 80GBNVIDIA | SS | 45.1 tok/s | 59.8 GB | |
| SS | 49.8 tok/s | 59.8 GB | ||
NVIDIA H200 SXM 141GBNVIDIA | SS | 64.6 tok/s | 59.8 GB | |
Google Cloud TPU v5pGoogle | SS | 37.2 tok/s | 59.8 GB | |
| SS | 71.3 tok/s | 59.8 GB | ||
NVIDIA B200 GPUNVIDIA | SS | 107.6 tok/s | 59.8 GB | |
| SS | 33.0 tok/s | 59.8 GB | ||
| SS | 80.7 tok/s | 59.8 GB | ||
| SS | 107.6 tok/s | 59.8 GB | ||
NVIDIA A100 SXM4 80GBNVIDIA | SS | 27.4 tok/s | 59.8 GB | |
| AA | 10.8 tok/s | 59.8 GB | ||
| BB | 8.3 tok/s | 59.8 GB | ||
| BB | 8.3 tok/s | 59.8 GB | ||
| BB | 8.3 tok/s | 59.8 GB | ||
| BB | 7.3 tok/s | 59.8 GB | ||
| BB | 7.3 tok/s | 59.8 GB | ||
| BB | 7.3 tok/s | 59.8 GB | ||
| BB | 7.3 tok/s | 59.8 GB | ||
| BB | 10.8 tok/s | 59.8 GB | ||
| BB | 5.4 tok/s | 59.8 GB | ||
| BB | 3.7 tok/s | 59.8 GB | ||
| BB | 11.0 tok/s | 59.8 GB | ||
| BB | 11.0 tok/s | 59.8 GB | ||
| CC | 5.4 tok/s | 59.8 GB | ||
| CC | 4.1 tok/s | 59.8 GB |
DeepSeek-V3.2 is a state-of-the-art Mixture-of-Experts (MoE) model that represents the peak of open-weights performance as of 2025. With a total parameter count of 685B, it is designed to compete directly with frontier models like GPT-5 and Claude 3.5 Sonnet. Despite its massive total scale, the model utilizes a highly optimized MoE architecture where only 37B parameters are active during any single inference step, striking a unique balance between extreme reasoning capabilities and computational efficiency.
Developed by DeepSeek, this model is the culmination of advancements in scalable Reinforcement Learning (RL) and Sparse Attention mechanisms. It is specifically engineered for complex tasks that require high-level logical deduction, such as advanced mathematics and software engineering. Its performance in the IMO 2025 (International Mathematical Olympiad) and IOI 2025 (International Olympiad in Informatics) benchmarks places it in the top tier of reasoning models globally. For practitioners looking to run DeepSeek-V3.2 locally, the primary challenge is not the compute speed—thanks to the 37B active parameters—but the massive VRAM footprint required to house the 685B parameter weights.
DeepSeek-V3.2 utilizes a sophisticated Mixture-of-Experts (MoE) framework that differentiates it from dense models like Llama 3.1 405B. In a dense model, every parameter is activated for every token generated. In DeepSeek-V3.2, the 685B total parameters act as a vast knowledge base, but the router only engages 37B parameters per token. This DeepSeek-V3.2 MoE efficiency allows for much faster inference speeds (tokens per second) than a 600B+ dense model would otherwise permit, provided the hardware can accommodate the full model in memory.
The model features DeepSeek’s proprietary Sparse Attention mechanism, which reduces the computational overhead of the self-attention layer. It supports a context length of 128,000 tokens, making it suitable for analyzing entire codebases or long-form technical documentation. However, practitioners should note that at 128k context, the KV (Key-Value) cache requirements become a significant factor in total DeepSeek-V3.2 VRAM requirements, especially when using high-precision formats.
DeepSeek-V3.2 was trained on a massive dataset with a 2025 cutoff, ensuring it is up-to-date with the latest programming frameworks and mathematical research. The use of scalable RL (Reinforcement Learning) during the post-training phase has specifically tuned the model for "Chain of Thought" reasoning, allowing it to verify its own logic before outputting a final answer.
This is a text-only model optimized for high-logic environments. Unlike general-purpose chat models that focus on creative writing, DeepSeek-V3.2 is a "reasoning-first" engine.
Attempting to run DeepSeek-V3.2 locally requires a significant investment in hardware. Because it is a local AI model 685B parameters 2025 release, it pushes the boundaries of what is possible on consumer and prosumer equipment.
To host this model, the primary bottleneck is VRAM. The 685B parameters must be loaded into memory to achieve usable performance.
For a Linux-based build, the best GPU for DeepSeek-V3.2 is the NVIDIA RTX 6000 Ada (48GB) or the RTX 3090/4090 (24GB).
Due to the MoE architecture, DeepSeek-V3.2 tokens per second are surprisingly high once the model is loaded. On an optimized 8x A100 setup, you can expect 20-30 tokens per second. On a consumer multi-GPU setup (4-bit), performance will likely hover between 2-5 tokens per second due to the overhead of PCIe communication between cards.
The quickest way to deploy the model is via Ollama or vLLM.
ollama run deepseek-v3.2:685b-q4_K_M (requires massive system RAM/VRAM)DeepSeek-V3.2 sits in a rare class of models. Its most direct competitors are Llama 3.1 405B and Grok-1.
For practitioners, the choice to use DeepSeek-V3.2 over a smaller 70B model comes down to whether your use case requires "frontier-level" reasoning. If you are building a simple RAG (Retrieval-Augmented Generation) system, a 70B model is more cost-effective. If you are building an automated code-refactoring agent or a mathematical verification tool, the hardware investment for DeepSeek-V3.2 is justified by its leap in logical accuracy.