made by agents

671B MoE reasoning model matching OpenAI o1 on math/coding. Uses RL to develop chain-of-thought reasoning. Caused global market shock on release. MIT licensed.
Copy and paste this command to start running the model locally.
ollama run deepseek-r1Access model weights, configuration files, and documentation.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 52.1 GB | Low | Aggressive quantization — smallest size, noticeable quality loss |
| Q4_K_MRecommended | 59.8 GB | Good | Best balance of size and quality for most use-cases |
| Q5_K_M | 63.5 GB | Very Good | Slightly better quality than Q4 with moderate size increase |
| Q6_K | 68.0 GB | Excellent | Near-lossless quality with manageable size |
| Q8_0 | 77.2 GB | Near Perfect | Virtually indistinguishable from full precision |
| FP16 | 112.4 GB | Full | Full 16-bit floating point — maximum quality, largest size |
See which devices can run this model and at what quality level.
NVIDIA H100 SXM5 80GBNVIDIA | SS | 45.1 tok/s | 59.8 GB | |
| SS | 49.8 tok/s | 59.8 GB | ||
NVIDIA H200 SXM 141GBNVIDIA | SS | 64.6 tok/s | 59.8 GB | |
Google Cloud TPU v5pGoogle | SS | 37.2 tok/s | 59.8 GB | |
| SS | 71.3 tok/s | 59.8 GB | ||
NVIDIA B200 GPUNVIDIA | SS | 107.6 tok/s | 59.8 GB | |
| SS | 33.0 tok/s | 59.8 GB | ||
| SS | 80.7 tok/s | 59.8 GB | ||
| SS | 107.6 tok/s | 59.8 GB | ||
NVIDIA A100 SXM4 80GBNVIDIA | SS | 27.4 tok/s | 59.8 GB | |
| AA | 10.8 tok/s | 59.8 GB | ||
| BB | 8.3 tok/s | 59.8 GB | ||
| BB | 8.3 tok/s | 59.8 GB | ||
| BB | 8.3 tok/s | 59.8 GB | ||
| BB | 7.3 tok/s | 59.8 GB | ||
| BB | 7.3 tok/s | 59.8 GB | ||
| BB | 7.3 tok/s | 59.8 GB | ||
| BB | 7.3 tok/s | 59.8 GB | ||
| BB | 10.8 tok/s | 59.8 GB | ||
| BB | 5.4 tok/s | 59.8 GB | ||
| BB | 3.7 tok/s | 59.8 GB | ||
| BB | 11.0 tok/s | 59.8 GB | ||
| BB | 11.0 tok/s | 59.8 GB | ||
| CC | 5.4 tok/s | 59.8 GB | ||
| CC | 4.1 tok/s | 59.8 GB |
DeepSeek-R1 is a 671B parameter Mixture-of-Experts (MoE) model that serves as the first open-weight alternative to proprietary reasoning models like OpenAI’s o1. Developed by DeepSeek, the model uses large-scale Reinforcement Learning (RL) to achieve state-of-the-art performance in mathematics, code generation, and complex logic. Unlike standard instruction-tuned models, R1 is designed to "think" before it speaks, generating a visible chain-of-thought (CoT) that allows it to self-correct and navigate multi-step problems.
The release of DeepSeek-R1 caused a significant shift in the local AI landscape by proving that a model with an MIT license could match or exceed the performance of the world’s most expensive closed-source APIs. For practitioners, the 671B model represents the current ceiling for local inference, requiring specialized hardware configurations to handle its massive memory footprint. While its total parameter count is high, its MoE architecture ensures that it remains computationally efficient during inference, activating only a fraction of its weights for any given token.
The core of DeepSeek-R1 is a Mixture-of-Experts (MoE) architecture. While the model contains 671 billion total parameters, it only uses 37 billion active parameters per token. This DeepSeek-R1 MoE efficiency is critical for local practitioners because it decouples the memory requirement from the compute requirement. While you need enough VRAM to hold the full 671B parameters, the actual "processing cost" (and thus the speed) is more comparable to a 37B dense model.
Technical specifications include:
The 128k context window allows for massive codebases or long-form documents to be ingested, though the VRAM requirements for KV cache at this length are substantial. The model uses a unique training recipe where the "reasoning" capability was incentivized through RL rather than just supervised fine-tuning (SFT), leading to the emergence of sophisticated logic patterns and the ability to handle high-level math and programming tasks.
DeepSeek-R1 is specifically optimized for tasks where accuracy and logic are more important than creative flair. It excels in environments where the model must verify its own work or follow strict logical constraints.
On the DeepSeek-R1 reasoning benchmark results, the model consistently matches o1-preview levels. It is capable of solving competitive-level math problems (AIME, MATH) and providing step-by-step proofs. For local users, this makes it an ideal tool for verifying scientific papers, solving engineering equations, or debugging complex logic gates.
The model is a top-tier choice for software engineering. It doesn't just suggest snippets; it can architect entire modules and explain the trade-offs between different implementations. Because it uses a chain-of-thought process, it is significantly better at catching edge cases in C++, Python, and Rust compared to dense models like Llama 3.1 70B. It is particularly effective for:
Despite its focus on reasoning, R1 is a highly capable general-purpose assistant. It follows complex system prompts with high fidelity and can handle multi-turn conversations without losing context. Its MIT license makes it a primary candidate for developers building commercial applications that require a local, high-reasoning backbone.
Running a 671B model locally is a significant hardware challenge. The DeepSeek-R1 VRAM requirements are the primary hurdle for most engineers. To run DeepSeek-R1 locally, you must account for the weights and the KV cache.
To run the full 671B model, consumer hardware is generally insufficient unless utilized in a multi-GPU cluster.
For most practitioners, Q4_K_M is the recommended quantization for balancing intelligence and memory. If you are limited by hardware, IQ4_XS or Q3_K_L offer a workable middle ground. Avoid going below Q2_K, as the reasoning capabilities—the model's primary selling point—begin to collapse at extremely low bitrates.
The DeepSeek-R1 tokens per second (t/s) will vary wildly based on your memory bandwidth.
If you do not have 400GB of VRAM, you cannot run the full 671B model effectively. However, DeepSeek has released "distilled" versions of R1 ranging from 1.5B to 70B parameters. For a single RTX 4090, the DeepSeek-R1-Distill-Llama-70B is the best choice, providing high-level reasoning within a 24GB-48GB VRAM envelope (using 4-bit quantization).
To get started quickly, Ollama is the most efficient path. Use the command ollama run deepseek-r1:671b (if you have the hardware) or ollama run deepseek-r1:70b for high-end consumer setups.
When evaluating DeepSeek-R1 vs Llama 3.1 405B, the primary difference is architecture and intent.
The best GPU for DeepSeek-R1 depends on your budget. For the full 671B model, a Mac Studio with 192GB+ Unified Memory is the most cost-effective single-device solution, while a cluster of RTX 6000 Ada or A100 80GB cards remains the gold standard for production-grade local inference.