made by agents

Hybrid model combining V3 and R1, supporting thinking and non-thinking modes. Enhanced tool calling and agent capabilities. 128K context.
Access model weights, configuration files, and documentation.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 52.1 GB | Low | Aggressive quantization — smallest size, noticeable quality loss |
| Q4_K_MRecommended | 59.8 GB | Good | Best balance of size and quality for most use-cases |
| Q5_K_M | 63.5 GB | Very Good | Slightly better quality than Q4 with moderate size increase |
| Q6_K | 68.0 GB | Excellent | Near-lossless quality with manageable size |
| Q8_0 | 77.2 GB | Near Perfect | Virtually indistinguishable from full precision |
| FP16 | 112.4 GB | Full | Full 16-bit floating point — maximum quality, largest size |
See which devices can run this model and at what quality level.
NVIDIA H100 SXM5 80GBNVIDIA | SS | 45.1 tok/s | 59.8 GB | |
| SS | 49.8 tok/s | 59.8 GB | ||
NVIDIA H200 SXM 141GBNVIDIA | SS | 64.6 tok/s | 59.8 GB | |
Google Cloud TPU v5pGoogle | SS | 37.2 tok/s | 59.8 GB | |
| SS | 71.3 tok/s | 59.8 GB | ||
NVIDIA B200 GPUNVIDIA | SS | 107.6 tok/s | 59.8 GB | |
| SS | 33.0 tok/s | 59.8 GB | ||
| SS | 80.7 tok/s | 59.8 GB | ||
| SS | 107.6 tok/s | 59.8 GB | ||
NVIDIA A100 SXM4 80GBNVIDIA | SS | 27.4 tok/s | 59.8 GB | |
| AA | 10.8 tok/s | 59.8 GB | ||
| BB | 8.3 tok/s | 59.8 GB | ||
| BB | 8.3 tok/s | 59.8 GB | ||
| BB | 8.3 tok/s | 59.8 GB | ||
| BB | 7.3 tok/s | 59.8 GB | ||
| BB | 7.3 tok/s | 59.8 GB | ||
| BB | 7.3 tok/s | 59.8 GB | ||
| BB | 7.3 tok/s | 59.8 GB | ||
| BB | 10.8 tok/s | 59.8 GB | ||
| BB | 5.4 tok/s | 59.8 GB | ||
| BB | 3.7 tok/s | 59.8 GB | ||
| BB | 11.0 tok/s | 59.8 GB | ||
| BB | 11.0 tok/s | 59.8 GB | ||
| CC | 5.4 tok/s | 59.8 GB | ||
| CC | 4.1 tok/s | 59.8 GB |
DeepSeek-V3.1 is a 671B parameter Mixture-of-Experts (MoE) model that represents the current state-of-the-art for open-weights large language models (LLMs). Developed by DeepSeek, this model is a hybrid evolution that merges the architectural strengths of DeepSeek-V3 with the advanced reasoning capabilities of the DeepSeek-R1 series. It is designed to compete directly with frontier models like GPT-4o and Llama 3.1 405B, offering a versatile platform for developers who require high-tier performance across coding, mathematics, and complex instruction-following without relying on closed-source APIs.
As a local AI model with 671B parameters in 2025, DeepSeek-V3.1 is unique because it supports both "thinking" (chain-of-thought) and "non-thinking" modes. This allows practitioners to deploy a single model for both rapid-fire chat applications and deep, multi-step reasoning tasks. While the total parameter count is massive, the model's MoE architecture ensures that it remains computationally efficient during inference, activating only a fraction of its total weights for any given token.
The core of DeepSeek-V3.1 is its sparse Mixture-of-Experts (MoE) architecture. Out of the 671B total parameters, only 37B are active during the forward pass for any single token. This design choice is critical for local practitioners because it decouples the model's knowledge capacity from its compute requirements. While you need enough VRAM to store the full 671B parameters, the DeepSeek-V3.1 MoE efficiency allows it to generate text at speeds comparable to much smaller dense models (like a 30B-40B parameter model).
Key architectural features include:
DeepSeek-V3.1 is a text-only model, but its breadth of capability rivals the most expensive proprietary models. It is specifically optimized for environments where precision and logic are paramount.
DeepSeek-V3.1 for coding is one of the most common deployments for this model. It excels at:
The model features enhanced tool-calling capabilities, making it an ideal "brain" for AI agents. It can reliably format JSON, call external APIs, and follow strict schemas. Unlike smaller models that often hallucinate function arguments, V3.1 maintains high reliability in multi-turn tool use, which is essential for autonomous local agents.
DeepSeek-V3.1 is highly proficient in multilingual environments, specifically in English and Chinese, but it also demonstrates strong performance in major European and Asian languages. Its reasoning benchmark scores in mathematics (AIME, MATH) are among the highest for open-weights models, making it a viable tool for scientific research and data analysis.
Running a 671B parameter model locally is a significant hardware challenge. The primary bottleneck is not the compute (FLOPs), but the memory (VRAM/RAM). To run DeepSeek-V3.1 locally, you must account for the massive footprint of the model weights.
The amount of memory required depends entirely on the quantization level. Using the standard BF16 (16-bit) format is impossible for almost all local setups, as it requires over 1.3TB of VRAM. Quantization is mandatory for local practitioners.
If you are looking for the best GPU for DeepSeek-V3.1, you generally need a multi-GPU array or a high-memory Unified Memory system.
The DeepSeek-V3.1 tokens per second (t/s) will vary based on your hardware's memory bandwidth. On a multi-4090 setup using llama.cpp, you can expect between 5-15 tokens per second depending on the quantization level and context usage.
Ollama is the quickest way to get started. Once you have the necessary hardware, you can run:
ollama run deepseek-v3.1:latest
(Note: Ensure you have selected a tag that fits your VRAM, such as the iq2_xs or q4_k_m variants).
When evaluating DeepSeek-V3.1 hardware requirements and performance, it is helpful to compare it against its closest competitors in the open-weights space.
For practitioners who can solve the how to run 671B model on consumer GPU puzzle—typically through massive multi-GPU nodes or high-RAM Mac systems—DeepSeek-V3.1 provides a level of local intelligence that was previously only available via cloud-based API providers.