DeepSeek

DeepSeek-R1

671B MoE reasoning model matching OpenAI o1 on math/coding. Uses RL to develop chain-of-thought reasoning. Caused global market shock on release. MIT licensed.

671B paramsMoE128K ctx

View on Hugging Face

Run with Ollama Source Code Official Page

Our Take

Best for: Strongest at broad knowledge (MMLU-Pro) in its size class

A strong 671B-parameter MoE language model from DeepSeek. Pulls ahead on broad knowledge (MMLU-Pro) (84/100), so reach for it when that's the dimension that matters.

Run this onNVIDIA A100 SXM4 80GBCheapest card in our directory with comfortable headroom (80 GB) for this model at Q4 (~59.8 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Reasoning

Math

Instruction Following

Model Specifications

Parameters671B

Active Params37B

ArchitectureMoE

Context Length128K tokens

ModalityText Only

Training CutoffEarly 2024

ProviderDeepSeek

Download Size688.6 GB

Community

Monthly Downloads4.2M

Likes13.3K

Last Updated1 years ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run deepseek-r1

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

MITView Full License

Performance & Scoring

Benchmarks

GPQA

71.5

MMLU-PRO

84.0

Arena Score

89.2

Overall Score

73.4AA

Benchmark40%

81.6

Popularity25%

89.7

Efficiency20%

52.5

Versatility15%

52.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	52.1 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	59.8 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	63.5 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	68.0 GB	Excellent	Near-lossless quality with manageable size
Q8_0	77.2 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	112.4 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


NVIDIA H100 SXM5 80GBNVIDIA	SS	45.1 tok/s	59.8 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	49.8 tok/s	59.8 GB
NVIDIA H200 SXM 141GBNVIDIA	SS	64.6 tok/s	59.8 GB
Google Cloud TPU v5pGoogle	SS	37.2 tok/s	59.8 GB
AMD Instinct MI300XAMD	SS	71.3 tok/s	59.8 GB
Google TPU v7 (Ironwood)Google	SS	99.3 tok/s	59.8 GB
NVIDIA B200 GPUNVIDIA	SS	107.6 tok/s	59.8 GB
Intel Gaudi 2 AI AcceleratorIntel	SS	33.0 tok/s	59.8 GB
AMD Instinct MI325XAMD	SS	80.7 tok/s	59.8 GB
AMD Instinct MI355XAMD	SS	107.6 tok/s	59.8 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	95.5 tok/s	59.8 GB
Dell Pro Max with GB300Dell	SS	95.5 tok/s	59.8 GB
Gigabyte W775-V10-L01Gigabyte	SS	95.5 tok/s	59.8 GB
HP ZGX Fury AI StationHP	SS	95.5 tok/s	59.8 GB
MSI XpertStation WS300MSI	SS	95.5 tok/s	59.8 GB
SuperMicro Super AI StationSuperMicro	SS	95.5 tok/s	59.8 GB
NVIDIA A100 SXM4 80GBNVIDIA	SS	27.4 tok/s	59.8 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	AA	10.8 tok/s	59.8 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	8.3 tok/s	59.8 GB
MacBook Pro 16-inch M5 Max (2026)Apple	BB	8.3 tok/s	59.8 GB
MacBook Pro 16" M5 Max (2026)Apple	BB	8.3 tok/s	59.8 GB
Corsair AI Workstation 300 (Ryzen AI Max+ 395)Corsair	BB	6.9 tok/s	59.8 GB
Apple M4 Max (40-core GPU)Apple	BB	7.3 tok/s	59.8 GB
Apple Mac Studio (M4 Max, 2025)Apple	BB	7.3 tok/s	59.8 GB
MacBook Pro 14-inch M4 Max (2024)Apple	BB	7.3 tok/s	59.8 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on Apple M4 Pro (14-core CPU, 20-core GPU) (~3.7 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)DeepSeek-R1 on Apple M4 Pro (14-core CPU, 20-core GPU) · ~3.7 tok/s · 60W	$0.545
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 60 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA A100 80GB SXMVast.ai · On-Demand · 80 GB VRAM	$0.27
AMD Instinct MI300XRunPod · Community · 192 GB VRAM	$0.50
NVIDIA H200 NVLRunPod · Community · 141 GB VRAM	$0.50
NVIDIA A100 80GB SXMVast.ai · Spot · 80 GB VRAM	$0.53
NVIDIA A100 80GB PCIeVast.ai · Spot · 80 GB VRAM	$1.02

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

DeepSeek-R1 is a 671B parameter Mixture-of-Experts (MoE) model that serves as the first open-weight alternative to proprietary reasoning models like OpenAI’s o1. Developed by DeepSeek, the model uses large-scale Reinforcement Learning (RL) to achieve state-of-the-art performance in mathematics, code generation, and complex logic. Unlike standard instruction-tuned models, R1 is designed to "think" before it speaks, generating a visible chain-of-thought (CoT) that allows it to self-correct and navigate multi-step problems.

The release of DeepSeek-R1 caused a significant shift in the local AI landscape by proving that a model with an MIT license could match or exceed the performance of the world’s most expensive closed-source APIs. For practitioners, the 671B model represents the current ceiling for local inference, requiring specialized hardware configurations to handle its massive memory footprint. While its total parameter count is high, its MoE architecture ensures that it remains computationally efficient during inference, activating only a fraction of its weights for any given token.

Architecture & Technical Details

The core of DeepSeek-R1 is a Mixture-of-Experts (MoE) architecture. While the model contains 671 billion total parameters, it only uses 37 billion active parameters per token. This DeepSeek-R1 MoE efficiency is critical for local practitioners because it decouples the memory requirement from the compute requirement. While you need enough VRAM to hold the full 671B parameters, the actual "processing cost" (and thus the speed) is more comparable to a 37B dense model.

Technical specifications include:

Total Parameters: 671B
Active Parameters: 37B
Context Length: 128,000 tokens
Training Cutoff: Early 2024
Architecture: Multi-head Latent Attention (MLA) and DeepSeekMoE
Training Method: Multi-stage Reinforcement Learning focusing on reasoning data

The 128k context window allows for massive codebases or long-form documents to be ingested, though the VRAM requirements for KV cache at this length are substantial. The model uses a unique training recipe where the "reasoning" capability was incentivized through RL rather than just supervised fine-tuning (SFT), leading to the emergence of sophisticated logic patterns and the ability to handle high-level math and programming tasks.

Capabilities & Use Cases

DeepSeek-R1 is specifically optimized for tasks where accuracy and logic are more important than creative flair. It excels in environments where the model must verify its own work or follow strict logical constraints.

Advanced Reasoning and Math

On the DeepSeek-R1 reasoning benchmark results, the model consistently matches o1-preview levels. It is capable of solving competitive-level math problems (AIME, MATH) and providing step-by-step proofs. For local users, this makes it an ideal tool for verifying scientific papers, solving engineering equations, or debugging complex logic gates.

DeepSeek-R1 for Coding

The model is a top-tier choice for software engineering. It doesn't just suggest snippets; it can architect entire modules and explain the trade-offs between different implementations. Because it uses a chain-of-thought process, it is significantly better at catching edge cases in C++, Python, and Rust compared to dense models like Llama 3.1 70B. It is particularly effective for:

Refactoring legacy code with high logical density.
Writing complex SQL queries and optimizing database schemas.
Generating unit tests that cover obscure failure states.

Instruction Following and Chat

Despite its focus on reasoning, R1 is a highly capable general-purpose assistant. It follows complex system prompts with high fidelity and can handle multi-turn conversations without losing context. Its MIT license makes it a primary candidate for developers building commercial applications that require a local, high-reasoning backbone.

Running DeepSeek-R1 Locally

Running a 671B model locally is a significant hardware challenge. The DeepSeek-R1 VRAM requirements are the primary hurdle for most engineers. To run DeepSeek-R1 locally, you must account for the weights and the KV cache.

Hardware Requirements

To run the full 671B model, consumer hardware is generally insufficient unless utilized in a multi-GPU cluster.

FP16 (Full Precision): Requires ~1.3 TB of VRAM. This is strictly in the realm of H100/A100 clusters.
4-bit Quantization (Q4_K_M): Requires ~380GB - 400GB of VRAM. This can be achieved with a head-node and 8x RTX 4090s (24GB each) or a Mac Studio with 512GB of Unified Memory.
2-bit Quantization (Q2_K): Requires ~190GB - 210GB of VRAM. This is the "sweet spot" for high-end consumer/prosumer hardware like the Mac Studio M2/M3/M4 Ultra with 192GB of RAM (though performance will degrade slightly due to heavy quantization).

Best Quantization for DeepSeek-R1

For most practitioners, Q4_K_M is the recommended quantization for balancing intelligence and memory. If you are limited by hardware, IQ4_XS or Q3_K_L offer a workable middle ground. Avoid going below Q2_K, as the reasoning capabilities—the model's primary selling point—begin to collapse at extremely low bitrates.

Expected Performance

The DeepSeek-R1 tokens per second (t/s) will vary wildly based on your memory bandwidth.

Multi-GPU (8x 4090 via P2P): 5–12 t/s.
Mac Studio (M3 Ultra): 3–8 t/s.
Unified Memory (DDR5 System RAM): <1 t/s (not recommended for interactive use).

How to Run 671B Model on Consumer GPUs

If you do not have 400GB of VRAM, you cannot run the full 671B model effectively. However, DeepSeek has released "distilled" versions of R1 ranging from 1.5B to 70B parameters. For a single RTX 4090, the DeepSeek-R1-Distill-Llama-70B is the best choice, providing high-level reasoning within a 24GB-48GB VRAM envelope (using 4-bit quantization).

To get started quickly, Ollama is the most efficient path. Use the command ollama run deepseek-r1:671b (if you have the hardware) or ollama run deepseek-r1:70b for high-end consumer setups.

How It Compares

When evaluating DeepSeek-R1 vs Llama 3.1 405B, the primary difference is architecture and intent.

Llama 3.1 405B: A dense model. It is more predictable in its conversational style and has a broader knowledge base for general trivia and creative writing. However, it requires massive compute for every token generated.
DeepSeek-R1: An MoE model. It is faster at inference despite having more total parameters. In terms of "raw brainpower" for math and coding, R1 generally outperforms Llama 405B because of its specialized RL-based reasoning paths.
Grok-1 (314B): While Grok-1 is also an MoE model, it lacks the specific reasoning/CoT tuning found in R1. DeepSeek-R1 is significantly more capable at multi-step problem solving and follows instructions with much higher precision.

The best GPU for DeepSeek-R1 depends on your budget. For the full 671B model, a Mac Studio with 192GB+ Unified Memory is the most cost-effective single-device solution, while a cluster of RTX 6000 Ada or A100 80GB cards remains the gold standard for production-grade local inference.

Related Models

DeepSeek

Explore the Provider

See all DeepSeek models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every DeepSeek model we track.

Open DeepSeek

Explore the Family

See every DeepSeek release

The full DeepSeek family leaderboard with sizes, benchmark scores, and a release timeline.

Open DeepSeek

Find the Best Hardware for This Model

Use our hardware calculator to find the optimal device for running this model.

DeepSeek

DeepSeek-R1

671B MoE reasoning model matching OpenAI o1 on math/coding. Uses RL to develop chain-of-thought reasoning. Caused global market shock on release. MIT licensed.

671B paramsMoE128K ctx

View on Hugging Face

Run with Ollama Source Code Official Page

Our Take

Best for: Strongest at broad knowledge (MMLU-Pro) in its size class

A strong 671B-parameter MoE language model from DeepSeek. Pulls ahead on broad knowledge (MMLU-Pro) (84/100), so reach for it when that's the dimension that matters.

Run this onNVIDIA A100 SXM4 80GBCheapest card in our directory with comfortable headroom (80 GB) for this model at Q4 (~59.8 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Reasoning

Math

Instruction Following

Model Specifications

Parameters671B

Active Params37B

ArchitectureMoE

Context Length128K tokens

ModalityText Only

Training CutoffEarly 2024

ProviderDeepSeek

Download Size688.6 GB

Community

Monthly Downloads4.2M

Likes13.3K

Last Updated1 years ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run deepseek-r1

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

MITView Full License

Performance & Scoring

Benchmarks

GPQA

71.5

MMLU-PRO

84.0

Arena Score

89.2

Overall Score

73.4AA

Benchmark40%

81.6

Popularity25%

89.7

Efficiency20%

52.5

Versatility15%

52.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	52.1 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	59.8 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	63.5 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	68.0 GB	Excellent	Near-lossless quality with manageable size
Q8_0	77.2 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	112.4 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


NVIDIA H100 SXM5 80GBNVIDIA	SS	45.1 tok/s	59.8 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	49.8 tok/s	59.8 GB
NVIDIA H200 SXM 141GBNVIDIA	SS	64.6 tok/s	59.8 GB
Google Cloud TPU v5pGoogle	SS	37.2 tok/s	59.8 GB
AMD Instinct MI300XAMD	SS	71.3 tok/s	59.8 GB
Google TPU v7 (Ironwood)Google	SS	99.3 tok/s	59.8 GB
NVIDIA B200 GPUNVIDIA	SS	107.6 tok/s	59.8 GB
Intel Gaudi 2 AI AcceleratorIntel	SS	33.0 tok/s	59.8 GB
AMD Instinct MI325XAMD	SS	80.7 tok/s	59.8 GB
AMD Instinct MI355XAMD	SS	107.6 tok/s	59.8 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	95.5 tok/s	59.8 GB
Dell Pro Max with GB300Dell	SS	95.5 tok/s	59.8 GB
Gigabyte W775-V10-L01Gigabyte	SS	95.5 tok/s	59.8 GB
HP ZGX Fury AI StationHP	SS	95.5 tok/s	59.8 GB
MSI XpertStation WS300MSI	SS	95.5 tok/s	59.8 GB
SuperMicro Super AI StationSuperMicro	SS	95.5 tok/s	59.8 GB
NVIDIA A100 SXM4 80GBNVIDIA	SS	27.4 tok/s	59.8 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	AA	10.8 tok/s	59.8 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	8.3 tok/s	59.8 GB
MacBook Pro 16-inch M5 Max (2026)Apple	BB	8.3 tok/s	59.8 GB
MacBook Pro 16" M5 Max (2026)Apple	BB	8.3 tok/s	59.8 GB
Corsair AI Workstation 300 (Ryzen AI Max+ 395)Corsair	BB	6.9 tok/s	59.8 GB
Apple M4 Max (40-core GPU)Apple	BB	7.3 tok/s	59.8 GB
Apple Mac Studio (M4 Max, 2025)Apple	BB	7.3 tok/s	59.8 GB
MacBook Pro 14-inch M4 Max (2024)Apple	BB	7.3 tok/s	59.8 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on Apple M4 Pro (14-core CPU, 20-core GPU) (~3.7 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)DeepSeek-R1 on Apple M4 Pro (14-core CPU, 20-core GPU) · ~3.7 tok/s · 60W	$0.545
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 60 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA A100 80GB SXMVast.ai · On-Demand · 80 GB VRAM	$0.27
AMD Instinct MI300XRunPod · Community · 192 GB VRAM	$0.50
NVIDIA H200 NVLRunPod · Community · 141 GB VRAM	$0.50
NVIDIA A100 80GB SXMVast.ai · Spot · 80 GB VRAM	$0.53
NVIDIA A100 80GB PCIeVast.ai · Spot · 80 GB VRAM	$1.02

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Architecture & Technical Details

Technical specifications include:

Total Parameters: 671B
Active Parameters: 37B
Context Length: 128,000 tokens
Training Cutoff: Early 2024
Architecture: Multi-head Latent Attention (MLA) and DeepSeekMoE
Training Method: Multi-stage Reinforcement Learning focusing on reasoning data

Capabilities & Use Cases

Advanced Reasoning and Math

DeepSeek-R1 for Coding

Refactoring legacy code with high logical density.
Writing complex SQL queries and optimizing database schemas.
Generating unit tests that cover obscure failure states.

Instruction Following and Chat

Running DeepSeek-R1 Locally

Hardware Requirements

To run the full 671B model, consumer hardware is generally insufficient unless utilized in a multi-GPU cluster.

FP16 (Full Precision): Requires ~1.3 TB of VRAM. This is strictly in the realm of H100/A100 clusters.
4-bit Quantization (Q4_K_M): Requires ~380GB - 400GB of VRAM. This can be achieved with a head-node and 8x RTX 4090s (24GB each) or a Mac Studio with 512GB of Unified Memory.
2-bit Quantization (Q2_K): Requires ~190GB - 210GB of VRAM. This is the "sweet spot" for high-end consumer/prosumer hardware like the Mac Studio M2/M3/M4 Ultra with 192GB of RAM (though performance will degrade slightly due to heavy quantization).

Best Quantization for DeepSeek-R1

Expected Performance

The DeepSeek-R1 tokens per second (t/s) will vary wildly based on your memory bandwidth.

Multi-GPU (8x 4090 via P2P): 5–12 t/s.
Mac Studio (M3 Ultra): 3–8 t/s.
Unified Memory (DDR5 System RAM): <1 t/s (not recommended for interactive use).

How to Run 671B Model on Consumer GPUs

To get started quickly, Ollama is the most efficient path. Use the command ollama run deepseek-r1:671b (if you have the hardware) or ollama run deepseek-r1:70b for high-end consumer setups.

How It Compares

When evaluating DeepSeek-R1 vs Llama 3.1 405B, the primary difference is architecture and intent.

Llama 3.1 405B: A dense model. It is more predictable in its conversational style and has a broader knowledge base for general trivia and creative writing. However, it requires massive compute for every token generated.
DeepSeek-R1: An MoE model. It is faster at inference despite having more total parameters. In terms of "raw brainpower" for math and coding, R1 generally outperforms Llama 405B because of its specialized RL-based reasoning paths.
Grok-1 (314B): While Grok-1 is also an MoE model, it lacks the specific reasoning/CoT tuning found in R1. DeepSeek-R1 is significantly more capable at multi-step problem solving and follows instructions with much higher precision.