DeepSeek

DeepSeek-V3.1

Hybrid model combining V3 and R1, supporting thinking and non-thinking modes. Enhanced tool calling and agent capabilities. 128K context.

671B paramsMoE128K ctx

View on Hugging Face

Run with Ollama Source Code Official Page

Our Take

Best for: Datacenter-tier inference at frontier quality

A solid 671B-parameter MoE language model from DeepSeek. A pragmatic middle-ground choice when you need open weights without a flagship-sized footprint.

Run this onNVIDIA A100 SXM4 80GBCheapest card in our directory with comfortable headroom (80 GB) for this model at Q4 (~59.8 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Reasoning

Function Calling

Multilingual

Math

Instruction Following

Model Specifications

Parameters671B

Active Params37B

ArchitectureMoE

Context Length128K tokens

ModalityText Only

Training Cutoff2024

ProviderDeepSeek

Download Size688.6 GB

Community

Monthly Downloads198.8K

Likes823

Last Updated8 months ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run deepseek-v3.1:latest

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

DeepSeek LicenseView Full License

Performance & Scoring

Benchmarks

Arena Score

87.9

Overall Score

65.7BB

Benchmark40%

87.9

Popularity25%

38.0

Efficiency20%

57.4

Versatility15%

63.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	52.1 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	59.8 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	63.5 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	68.0 GB	Excellent	Near-lossless quality with manageable size
Q8_0	77.2 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	112.4 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


NVIDIA H100 SXM5 80GBNVIDIA	SS	45.1 tok/s	59.8 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	49.8 tok/s	59.8 GB
NVIDIA H200 SXM 141GBNVIDIA	SS	64.6 tok/s	59.8 GB
Google Cloud TPU v5pGoogle	SS	37.2 tok/s	59.8 GB
AMD Instinct MI300XAMD	SS	71.3 tok/s	59.8 GB
Google TPU v7 (Ironwood)Google	SS	99.3 tok/s	59.8 GB
NVIDIA B200 GPUNVIDIA	SS	107.6 tok/s	59.8 GB
Intel Gaudi 2 AI AcceleratorIntel	SS	33.0 tok/s	59.8 GB
AMD Instinct MI325XAMD	SS	80.7 tok/s	59.8 GB
AMD Instinct MI355XAMD	SS	107.6 tok/s	59.8 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	95.5 tok/s	59.8 GB
Dell Pro Max with GB300Dell	SS	95.5 tok/s	59.8 GB
Gigabyte W775-V10-L01Gigabyte	SS	95.5 tok/s	59.8 GB
HP ZGX Fury AI StationHP	SS	95.5 tok/s	59.8 GB
MSI XpertStation WS300MSI	SS	95.5 tok/s	59.8 GB
SuperMicro Super AI StationSuperMicro	SS	95.5 tok/s	59.8 GB
NVIDIA A100 SXM4 80GBNVIDIA	SS	27.4 tok/s	59.8 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	AA	10.8 tok/s	59.8 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	8.3 tok/s	59.8 GB
MacBook Pro 16-inch M5 Max (2026)Apple	BB	8.3 tok/s	59.8 GB
MacBook Pro 16" M5 Max (2026)Apple	BB	8.3 tok/s	59.8 GB
Corsair AI Workstation 300 (Ryzen AI Max+ 395)Corsair	BB	6.9 tok/s	59.8 GB
Apple M4 Max (40-core GPU)Apple	BB	7.3 tok/s	59.8 GB
Apple Mac Studio (M4 Max, 2025)Apple	BB	7.3 tok/s	59.8 GB
MacBook Pro 14-inch M4 Max (2024)Apple	BB	7.3 tok/s	59.8 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on Apple M4 Pro (14-core CPU, 20-core GPU) (~3.7 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)DeepSeek-V3.1 on Apple M4 Pro (14-core CPU, 20-core GPU) · ~3.7 tok/s · 60W	$0.545
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 60 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA A100 80GB SXMVast.ai · On-Demand · 80 GB VRAM	$0.27
AMD Instinct MI300XRunPod · Community · 192 GB VRAM	$0.50
NVIDIA H200 NVLRunPod · Community · 141 GB VRAM	$0.50
NVIDIA A100 80GB SXMVast.ai · Spot · 80 GB VRAM	$0.53
NVIDIA A100 80GB PCIeVast.ai · Spot · 80 GB VRAM	$1.02

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

DeepSeek-V3.1 is a 671B parameter Mixture-of-Experts (MoE) model that represents the current state-of-the-art for open-weights large language models (LLMs). Developed by DeepSeek, this model is a hybrid evolution that merges the architectural strengths of DeepSeek-V3 with the advanced reasoning capabilities of the DeepSeek-R1 series. It is designed to compete directly with frontier models like GPT-4o and Llama 3.1 405B, offering a versatile platform for developers who require high-tier performance across coding, mathematics, and complex instruction-following without relying on closed-source APIs.

As a local AI model with 671B parameters in 2025, DeepSeek-V3.1 is unique because it supports both "thinking" (chain-of-thought) and "non-thinking" modes. This allows practitioners to deploy a single model for both rapid-fire chat applications and deep, multi-step reasoning tasks. While the total parameter count is massive, the model's MoE architecture ensures that it remains computationally efficient during inference, activating only a fraction of its total weights for any given token.

Architecture & Technical Details

The core of DeepSeek-V3.1 is its sparse Mixture-of-Experts (MoE) architecture. Out of the 671B total parameters, only 37B are active during the forward pass for any single token. This design choice is critical for local practitioners because it decouples the model's knowledge capacity from its compute requirements. While you need enough VRAM to store the full 671B parameters, the DeepSeek-V3.1 MoE efficiency allows it to generate text at speeds comparable to much smaller dense models (like a 30B-40B parameter model).

Key architectural features include:

Multi-head Latent Attention (MLA): This optimizes the KV cache, significantly reducing the memory overhead required to handle its 128,000-token context length. This makes the model particularly effective for long-context tasks like analyzing entire codebases or long legal documents.
DeepSeekMoE: A proprietary MoE framework that utilizes fine-grained expert routing, ensuring that the 37B active parameters are utilized with maximum efficiency for the specific task at hand.
Hybrid Reasoning: V3.1 incorporates the reinforcement learning (RL) breakthroughs from the R1 series. It can perform internal "thinking" to solve complex logic puzzles or math problems before outputting a final answer, but it can also bypass this for standard conversational tasks to save on latency.

Capabilities & Use Cases

DeepSeek-V3.1 is a text-only model, but its breadth of capability rivals the most expensive proprietary models. It is specifically optimized for environments where precision and logic are paramount.

Advanced Coding and Software Engineering

DeepSeek-V3.1 for coding is one of the most common deployments for this model. It excels at:

Repository-level refactoring: Using its 128K context window to understand dependencies across multiple files.
Complex Debugging: Utilizing the R1-style reasoning mode to trace logic errors in multi-threaded applications.
Boilerplate Generation: Rapidly producing production-ready code in Python, C++, Rust, and Go.

Agentic Workflows and Function Calling

The model features enhanced tool-calling capabilities, making it an ideal "brain" for AI agents. It can reliably format JSON, call external APIs, and follow strict schemas. Unlike smaller models that often hallucinate function arguments, V3.1 maintains high reliability in multi-turn tool use, which is essential for autonomous local agents.

Multilingual Logic and Math

DeepSeek-V3.1 is highly proficient in multilingual environments, specifically in English and Chinese, but it also demonstrates strong performance in major European and Asian languages. Its reasoning benchmark scores in mathematics (AIME, MATH) are among the highest for open-weights models, making it a viable tool for scientific research and data analysis.

Running DeepSeek-V3.1 Locally

Running a 671B parameter model locally is a significant hardware challenge. The primary bottleneck is not the compute (FLOPs), but the memory (VRAM/RAM). To run DeepSeek-V3.1 locally, you must account for the massive footprint of the model weights.

DeepSeek-V3.1 VRAM Requirements

The amount of memory required depends entirely on the quantization level. Using the standard BF16 (16-bit) format is impossible for almost all local setups, as it requires over 1.3TB of VRAM. Quantization is mandatory for local practitioners.

Q4_K_M (4-bit): Recommended for the best balance of logic and performance. Requires ~380GB - 400GB of VRAM.
Q3_K_L (3-bit): A viable middle ground. Requires ~280GB - 300GB of VRAM.
IQ2_M (2-bit): The minimum viable version for most high-end enthusiasts. Requires ~180GB - 200GB of VRAM.

Hardware Recommendations

If you are looking for the best GPU for DeepSeek-V3.1, you generally need a multi-GPU array or a high-memory Unified Memory system.

Consumer GPU Arrays (Multi-GPU):

To hit the 200GB threshold for 2-bit quantization, you would need 8x RTX 3090/4090 (24GB each) connected via NVLink or high-speed PCIe. This is a common "homelab" setup for a local AI model 671B parameters 2025.

Mac Studio / Mac Pro:

An M2 Ultra or M4 Max/Ultra with 192GB of Unified Memory is the most power-efficient way to run the 2-bit or 2.5-bit versions of DeepSeek-V3.1.

Professional Grade:

A cluster of 4x or 8x NVIDIA A100 (80GB) or H100 (80GB) is required to run higher-precision versions (Q4 or Q8).

Performance and Software

The DeepSeek-V3.1 tokens per second (t/s) will vary based on your hardware's memory bandwidth. On a multi-4090 setup using llama.cpp, you can expect between 5-15 tokens per second depending on the quantization level and context usage.

Ollama is the quickest way to get started. Once you have the necessary hardware, you can run:

ollama run deepseek-v3.1:latest

(Note: Ensure you have selected a tag that fits your VRAM, such as the iq2_xs or q4_k_m variants).

How It Compares

When evaluating DeepSeek-V3.1 hardware requirements and performance, it is helpful to compare it against its closest competitors in the open-weights space.

DeepSeek-V3.1 vs. Llama 3.1 405B

Architecture: Llama 3.1 405B is a dense model, while DeepSeek-V3.1 is an MoE.
Inference Speed: DeepSeek-V3.1 is generally faster at generating tokens because it only activates 37B parameters per token, whereas Llama 405B must activate all 405B parameters for every token.
VRAM: DeepSeek-V3.1 requires significantly more VRAM (671B vs 405B) to load the model, even though it uses less compute during the actual generation.
Reasoning: DeepSeek-V3.1's hybrid R1 "thinking" mode gives it a distinct advantage in complex math and coding tasks compared to the standard instruction-tuned Llama 405B.

DeepSeek-V3.1 vs. Qwen2.5-72B

Scale: DeepSeek-V3.1 is an order of magnitude larger in total parameters.
Performance: While Qwen2.5-72B is excellent for its size and much easier to run on a single or dual RTX 4090 setup, DeepSeek-V3.1 offers superior "world knowledge" and deeper reasoning capabilities. Choose Qwen if you are limited to <100GB of VRAM; choose DeepSeek-V3.1 if you have the hardware to support its 671B weight set.

For practitioners who can solve the how to run 671B model on consumer GPU puzzle—typically through massive multi-GPU nodes or high-RAM Mac systems—DeepSeek-V3.1 provides a level of local intelligence that was previously only available via cloud-based API providers.

Related Models

DeepSeek

Explore the Provider

See all DeepSeek models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every DeepSeek model we track.

Open DeepSeek

Explore the Family

See every DeepSeek release

The full DeepSeek family leaderboard with sizes, benchmark scores, and a release timeline.

Open DeepSeek

Find the Best Hardware for This Model

Use our hardware calculator to find the optimal device for running this model.

DeepSeek

DeepSeek-V3.1

Hybrid model combining V3 and R1, supporting thinking and non-thinking modes. Enhanced tool calling and agent capabilities. 128K context.

671B paramsMoE128K ctx

View on Hugging Face

Run with Ollama Source Code Official Page

Our Take

Best for: Datacenter-tier inference at frontier quality

A solid 671B-parameter MoE language model from DeepSeek. A pragmatic middle-ground choice when you need open weights without a flagship-sized footprint.

Run this onNVIDIA A100 SXM4 80GBCheapest card in our directory with comfortable headroom (80 GB) for this model at Q4 (~59.8 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Reasoning

Function Calling

Multilingual

Math

Instruction Following

Model Specifications

Parameters671B

Active Params37B

ArchitectureMoE

Context Length128K tokens

ModalityText Only

Training Cutoff2024

ProviderDeepSeek

Download Size688.6 GB

Community

Monthly Downloads198.8K

Likes823

Last Updated8 months ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run deepseek-v3.1:latest

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

DeepSeek LicenseView Full License

Performance & Scoring

Benchmarks

Arena Score

87.9

Overall Score

65.7BB

Benchmark40%

87.9

Popularity25%

38.0

Efficiency20%

57.4

Versatility15%

63.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	52.1 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	59.8 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	63.5 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	68.0 GB	Excellent	Near-lossless quality with manageable size
Q8_0	77.2 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	112.4 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


NVIDIA H100 SXM5 80GBNVIDIA	SS	45.1 tok/s	59.8 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	49.8 tok/s	59.8 GB
NVIDIA H200 SXM 141GBNVIDIA	SS	64.6 tok/s	59.8 GB
Google Cloud TPU v5pGoogle	SS	37.2 tok/s	59.8 GB
AMD Instinct MI300XAMD	SS	71.3 tok/s	59.8 GB
Google TPU v7 (Ironwood)Google	SS	99.3 tok/s	59.8 GB
NVIDIA B200 GPUNVIDIA	SS	107.6 tok/s	59.8 GB
Intel Gaudi 2 AI AcceleratorIntel	SS	33.0 tok/s	59.8 GB
AMD Instinct MI325XAMD	SS	80.7 tok/s	59.8 GB
AMD Instinct MI355XAMD	SS	107.6 tok/s	59.8 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	95.5 tok/s	59.8 GB
Dell Pro Max with GB300Dell	SS	95.5 tok/s	59.8 GB
Gigabyte W775-V10-L01Gigabyte	SS	95.5 tok/s	59.8 GB
HP ZGX Fury AI StationHP	SS	95.5 tok/s	59.8 GB
MSI XpertStation WS300MSI	SS	95.5 tok/s	59.8 GB
SuperMicro Super AI StationSuperMicro	SS	95.5 tok/s	59.8 GB
NVIDIA A100 SXM4 80GBNVIDIA	SS	27.4 tok/s	59.8 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	AA	10.8 tok/s	59.8 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	8.3 tok/s	59.8 GB
MacBook Pro 16-inch M5 Max (2026)Apple	BB	8.3 tok/s	59.8 GB
MacBook Pro 16" M5 Max (2026)Apple	BB	8.3 tok/s	59.8 GB
Corsair AI Workstation 300 (Ryzen AI Max+ 395)Corsair	BB	6.9 tok/s	59.8 GB
Apple M4 Max (40-core GPU)Apple	BB	7.3 tok/s	59.8 GB
Apple Mac Studio (M4 Max, 2025)Apple	BB	7.3 tok/s	59.8 GB
MacBook Pro 14-inch M4 Max (2024)Apple	BB	7.3 tok/s	59.8 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on Apple M4 Pro (14-core CPU, 20-core GPU) (~3.7 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)DeepSeek-V3.1 on Apple M4 Pro (14-core CPU, 20-core GPU) · ~3.7 tok/s · 60W	$0.545
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 60 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA A100 80GB SXMVast.ai · On-Demand · 80 GB VRAM	$0.27
AMD Instinct MI300XRunPod · Community · 192 GB VRAM	$0.50
NVIDIA H200 NVLRunPod · Community · 141 GB VRAM	$0.50
NVIDIA A100 80GB SXMVast.ai · Spot · 80 GB VRAM	$0.53
NVIDIA A100 80GB PCIeVast.ai · Spot · 80 GB VRAM	$1.02

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Architecture & Technical Details

Key architectural features include:

Multi-head Latent Attention (MLA): This optimizes the KV cache, significantly reducing the memory overhead required to handle its 128,000-token context length. This makes the model particularly effective for long-context tasks like analyzing entire codebases or long legal documents.
DeepSeekMoE: A proprietary MoE framework that utilizes fine-grained expert routing, ensuring that the 37B active parameters are utilized with maximum efficiency for the specific task at hand.
Hybrid Reasoning: V3.1 incorporates the reinforcement learning (RL) breakthroughs from the R1 series. It can perform internal "thinking" to solve complex logic puzzles or math problems before outputting a final answer, but it can also bypass this for standard conversational tasks to save on latency.

Capabilities & Use Cases

DeepSeek-V3.1 is a text-only model, but its breadth of capability rivals the most expensive proprietary models. It is specifically optimized for environments where precision and logic are paramount.

Advanced Coding and Software Engineering

DeepSeek-V3.1 for coding is one of the most common deployments for this model. It excels at:

Repository-level refactoring: Using its 128K context window to understand dependencies across multiple files.
Complex Debugging: Utilizing the R1-style reasoning mode to trace logic errors in multi-threaded applications.
Boilerplate Generation: Rapidly producing production-ready code in Python, C++, Rust, and Go.

Agentic Workflows and Function Calling

Multilingual Logic and Math

Running DeepSeek-V3.1 Locally

DeepSeek-V3.1 VRAM Requirements

Q4_K_M (4-bit): Recommended for the best balance of logic and performance. Requires ~380GB - 400GB of VRAM.
Q3_K_L (3-bit): A viable middle ground. Requires ~280GB - 300GB of VRAM.
IQ2_M (2-bit): The minimum viable version for most high-end enthusiasts. Requires ~180GB - 200GB of VRAM.

Hardware Recommendations

If you are looking for the best GPU for DeepSeek-V3.1, you generally need a multi-GPU array or a high-memory Unified Memory system.

Consumer GPU Arrays (Multi-GPU):

To hit the 200GB threshold for 2-bit quantization, you would need 8x RTX 3090/4090 (24GB each) connected via NVLink or high-speed PCIe. This is a common "homelab" setup for a local AI model 671B parameters 2025.

Mac Studio / Mac Pro:

An M2 Ultra or M4 Max/Ultra with 192GB of Unified Memory is the most power-efficient way to run the 2-bit or 2.5-bit versions of DeepSeek-V3.1.

Professional Grade:

A cluster of 4x or 8x NVIDIA A100 (80GB) or H100 (80GB) is required to run higher-precision versions (Q4 or Q8).

Performance and Software

Ollama is the quickest way to get started. Once you have the necessary hardware, you can run:

ollama run deepseek-v3.1:latest

(Note: Ensure you have selected a tag that fits your VRAM, such as the iq2_xs or q4_k_m variants).

How It Compares

When evaluating DeepSeek-V3.1 hardware requirements and performance, it is helpful to compare it against its closest competitors in the open-weights space.

DeepSeek-V3.1 vs. Llama 3.1 405B

Architecture: Llama 3.1 405B is a dense model, while DeepSeek-V3.1 is an MoE.
Inference Speed: DeepSeek-V3.1 is generally faster at generating tokens because it only activates 37B parameters per token, whereas Llama 405B must activate all 405B parameters for every token.
VRAM: DeepSeek-V3.1 requires significantly more VRAM (671B vs 405B) to load the model, even though it uses less compute during the actual generation.
Reasoning: DeepSeek-V3.1's hybrid R1 "thinking" mode gives it a distinct advantage in complex math and coding tasks compared to the standard instruction-tuned Llama 405B.

DeepSeek-V3.1 vs. Qwen2.5-72B

Scale: DeepSeek-V3.1 is an order of magnitude larger in total parameters.
Performance: While Qwen2.5-72B is excellent for its size and much easier to run on a single or dual RTX 4090 setup, DeepSeek-V3.1 offers superior "world knowledge" and deeper reasoning capabilities. Choose Qwen if you are limited to <100GB of VRAM; choose DeepSeek-V3.1 if you have the hardware to support its 671B weight set.