DeepSeek

DeepSeek-V3

Powerful MoE model with 671B total / 37B active, trained on 14.8T tokens for only $5.6M. Uses Multi-head Latent Attention. Comparable to GPT-4o at release.

671B paramsMoE128K ctx

View on Hugging Face

Run with Ollama Source Code Official Page

Our Take

Best for: Strongest at grade-school math in its size class

A solid 671B-parameter MoE language model from DeepSeek. Pulls ahead on grade-school math (89/100), so reach for it when that's the dimension that matters.

Run this onNVIDIA A100 SXM4 80GBCheapest card in our directory with comfortable headroom (80 GB) for this model at Q4 (~59.8 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Reasoning

Function Calling

Multilingual

Math

Instruction Following

Model Specifications

Parameters671B

Active Params37B

ArchitectureMoE

Context Length128K tokens

ModalityText Only

Training CutoffEarly 2024

ProviderDeepSeek

Download Size688.7 GB

Community

Monthly Downloads1.2M

Likes4.1K

Last Updated1 years ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run deepseek-v3

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

DeepSeek LicenseView Full License

Performance & Scoring

Benchmarks

MMLU-PRO

64.4

GSM8K

89.3

Arena Score

81.2

Overall Score

69.2BB

Benchmark40%

78.3

Popularity25%

73.9

Efficiency20%

49.2

Versatility15%

63.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	52.1 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	59.8 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	63.5 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	68.0 GB	Excellent	Near-lossless quality with manageable size
Q8_0	77.2 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	112.4 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


NVIDIA H100 SXM5 80GBNVIDIA	SS	45.1 tok/s	59.8 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	49.8 tok/s	59.8 GB
NVIDIA H200 SXM 141GBNVIDIA	SS	64.6 tok/s	59.8 GB
Google Cloud TPU v5pGoogle	SS	37.2 tok/s	59.8 GB
AMD Instinct MI300XAMD	SS	71.3 tok/s	59.8 GB
Google TPU v7 (Ironwood)Google	SS	99.3 tok/s	59.8 GB
NVIDIA B200 GPUNVIDIA	SS	107.6 tok/s	59.8 GB
Intel Gaudi 2 AI AcceleratorIntel	SS	33.0 tok/s	59.8 GB
AMD Instinct MI325XAMD	SS	80.7 tok/s	59.8 GB
AMD Instinct MI355XAMD	SS	107.6 tok/s	59.8 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	95.5 tok/s	59.8 GB
Dell Pro Max with GB300Dell	SS	95.5 tok/s	59.8 GB
Gigabyte W775-V10-L01Gigabyte	SS	95.5 tok/s	59.8 GB
HP ZGX Fury AI StationHP	SS	95.5 tok/s	59.8 GB
MSI XpertStation WS300MSI	SS	95.5 tok/s	59.8 GB
SuperMicro Super AI StationSuperMicro	SS	95.5 tok/s	59.8 GB
NVIDIA A100 SXM4 80GBNVIDIA	SS	27.4 tok/s	59.8 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	AA	10.8 tok/s	59.8 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	8.3 tok/s	59.8 GB
MacBook Pro 16-inch M5 Max (2026)Apple	BB	8.3 tok/s	59.8 GB
MacBook Pro 16" M5 Max (2026)Apple	BB	8.3 tok/s	59.8 GB
Corsair AI Workstation 300 (Ryzen AI Max+ 395)Corsair	BB	6.9 tok/s	59.8 GB
Apple M4 Max (40-core GPU)Apple	BB	7.3 tok/s	59.8 GB
Apple Mac Studio (M4 Max, 2025)Apple	BB	7.3 tok/s	59.8 GB
MacBook Pro 14-inch M4 Max (2024)Apple	BB	7.3 tok/s	59.8 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on Apple M4 Pro (14-core CPU, 20-core GPU) (~3.7 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)DeepSeek-V3 on Apple M4 Pro (14-core CPU, 20-core GPU) · ~3.7 tok/s · 60W	$0.545
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 60 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA A100 80GB SXMVast.ai · On-Demand · 80 GB VRAM	$0.27
AMD Instinct MI300XRunPod · Community · 192 GB VRAM	$0.50
NVIDIA H200 NVLRunPod · Community · 141 GB VRAM	$0.50
NVIDIA A100 80GB SXMVast.ai · Spot · 80 GB VRAM	$0.53
NVIDIA A100 80GB PCIeVast.ai · Spot · 80 GB VRAM	$1.02

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

DeepSeek-V3 represents a significant shift in the landscape of open-weights large language models (LLMs). Developed by DeepSeek, it is a 671B parameter Mixture-of-Experts (MoE) model designed to compete directly with frontier models like GPT-4o and Claude 3.5 Sonnet. While its total parameter count is massive, the architecture is highly optimized for inference efficiency, utilizing only 37B active parameters per token.

For practitioners looking to run DeepSeek-V3 locally, the challenge is not compute—which is handled efficiently by the MoE architecture—but memory. At 671B parameters, this model pushes the boundaries of what is possible on local hardware, requiring significant VRAM or high-capacity unified memory systems. It is positioned as the premier open-weights choice for developers requiring high-end reasoning, complex coding capabilities, and deep multilingual support without relying on proprietary APIs.

Architecture & Technical Details

The defining characteristic of DeepSeek-V3 is its Mixture-of-Experts (MoE) architecture coupled with Multi-head Latent Attention (MLA). This combination addresses the two primary bottlenecks in local LLM deployment: computational cost and KV cache size.

Mixture-of-Experts (MoE) Efficiency

In a standard dense model, every parameter is activated for every token. In DeepSeek-V3’s MoE setup, the model contains 671B total parameters, but only 37B are "active" during the forward pass. This DeepSeek-V3 MoE efficiency allows the model to deliver the reasoning capabilities of a 600B+ parameter model while maintaining the inference latency of a much smaller 37B parameter model. If you have the VRAM to house the weights, the DeepSeek-V3 tokens per second will be surprisingly high compared to dense models like Llama 3.1 405B.

Multi-head Latent Attention (MLA)

DeepSeek-V3 utilizes MLA to significantly compress the KV (Key-Value) cache. In traditional Multi-Head Attention (MHA), the KV cache grows linearly with context length and model size, often becoming the primary bottleneck for long-context inference. MLA compresses the KV cache into a latent vector, reducing the memory footprint of the 128,000-token context window. This makes it feasible to run long-context queries on hardware that would otherwise run out of memory.

Training and Data

The model was trained on 14.8 trillion tokens with a training cutoff in early 2024. Remarkably, DeepSeek achieved this for a total training cost of approximately $5.6M, showcasing extreme algorithmic efficiency. The model supports a wide array of programming languages and demonstrates state-of-the-art performance in mathematics and logic.

Capabilities & Use Cases

DeepSeek-V3 is a general-purpose model with specific tuning for high-complexity tasks. It is not merely a chatbot; it is a functional engine for automated workflows and technical development.

DeepSeek-V3 for Coding: The model excels at code generation, refactoring, and debugging across dozens of languages. It inherits the strengths of the DeepSeek-Coder series, making it highly effective for boilerplate generation, unit test creation, and explaining complex legacy codebases.
Reasoning and Mathematics: In DeepSeek-V3 reasoning benchmarks, the model consistently rivals or beats GPT-4o in MATH and GSM8K datasets. This makes it ideal for local deployments involving data science, financial modeling, or architectural planning.
Function-Calling and Tool Use: DeepSeek-V3 features robust support for function-calling, allowing it to act as an agentic core. It can parse user intent and output structured JSON to interact with external APIs, databases, or local file systems.
Multilingual Support: While many models are English-centric, DeepSeek-V3 provides high-tier performance in Chinese, English, and several other major languages, making it a viable option for global applications.

Running DeepSeek-V3 Locally

Running a 671B parameter model locally is a significant engineering undertaking. The primary hurdle is the DeepSeek-V3 VRAM requirements. Because the model must be loaded into memory to achieve usable speeds, consumer-grade hardware requires aggressive quantization.

DeepSeek-V3 Hardware Requirements

To determine the best GPU for DeepSeek-V3, you must first decide on your quantization level.

Quantization	VRAM Required (Approx.)	Recommended Hardware
FP8 (Native)	~700 GB	8x H100 (80GB) or 16x A100 (40GB)
Q4_K_M (GGUF)	~390 GB	10x-12x RTX 3090/4090 (24GB) or Mac Studio M2/M3 Ultra (192GB + Swapping)
Q2_K (GGUF)	~210 GB	8x-9x RTX 3090/4090 (24GB) or Mac Studio M2/M3 Ultra (192GB)

How to run 671B model on consumer GPUs

For most practitioners, the only way to run DeepSeek-V3 locally on consumer hardware is through a multi-GPU cluster or a high-spec Mac.

NVIDIA RTX 3090/4090 Clusters: You will need at least 8x 24GB GPUs to run a Q2 quantization. This requires specialized motherboards and power delivery systems.
Apple Silicon (Mac Studio/Mac Pro): An M2 Ultra or M3 Ultra with 192GB of Unified Memory is the most "accessible" way to run this model. While 192GB is technically short of the ~210GB needed for Q2, macOS can use swap memory to bridge the gap, though performance will degrade significantly once swapping begins.
Ollama: This is the quickest way to get started. By running ollama run deepseek-v3, the software will attempt to manage the memory allocation and quantization for you, though it will likely default to a heavily quantized version (IQ2_XXS) if your VRAM is limited.

Best Quantization for DeepSeek-V3

For a model of this size, Q4_K_M is generally considered the "gold standard" for maintaining intelligence. However, the jump from Q2 to Q4 almost doubles the VRAM requirement. For local practitioners, Q2_K or IQ2_M is often the only realistic choice. Due to the massive parameter count, even a Q2 quantization of DeepSeek-V3 often outperforms a Q8 quantization of a 70B model.

How It Compares

DeepSeek-V3 occupies a unique niche as a high-parameter MoE model. Here is how it stacks up against realistic alternatives:

DeepSeek-V3 vs. Llama 3.1 405B

Llama 3.1 405B is a dense model, meaning every parameter is used for every token.

Performance: Both models trade blows on benchmarks, with DeepSeek-V3 often leading in coding and math.
Inference Speed: DeepSeek-V3 is significantly faster during inference because it only activates 37B parameters. Llama 405B is computationally much heavier.
Hardware: Both require massive VRAM (350GB+ for 4-bit), but DeepSeek-V3 provides a better "intelligence-per-watt" ratio during active generation.

DeepSeek-V3 vs. Qwen2.5 72B

Qwen2.5 72B is a much smaller dense model that is easier to run on a single or dual-GPU setup.

Performance: DeepSeek-V3 is substantially more capable in complex reasoning and long-context retrieval (128k context vs Qwen's typical performance drop-off at high context).
Accessibility: Qwen2.5 72B can run on 2x RTX 3090s at Q4 quantization. DeepSeek-V3 requires a massive hardware investment by comparison.

DeepSeek-V3 is the local AI model 671B parameters 2025 benchmark. It is intended for users who have moved past the capabilities of 70B-class models and have the hardware infrastructure to support a massive memory footprint in exchange for frontier-level reasoning and coding performance.

Related Models

DeepSeek

Explore the Provider

See all DeepSeek models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every DeepSeek model we track.

Open DeepSeek

Explore the Family

See every DeepSeek release

The full DeepSeek family leaderboard with sizes, benchmark scores, and a release timeline.

Open DeepSeek

Find the Best Hardware for This Model

Use our hardware calculator to find the optimal device for running this model.

DeepSeek

DeepSeek-V3

Powerful MoE model with 671B total / 37B active, trained on 14.8T tokens for only $5.6M. Uses Multi-head Latent Attention. Comparable to GPT-4o at release.

671B paramsMoE128K ctx

View on Hugging Face

Run with Ollama Source Code Official Page

Our Take

Best for: Strongest at grade-school math in its size class

A solid 671B-parameter MoE language model from DeepSeek. Pulls ahead on grade-school math (89/100), so reach for it when that's the dimension that matters.

Run this onNVIDIA A100 SXM4 80GBCheapest card in our directory with comfortable headroom (80 GB) for this model at Q4 (~59.8 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Reasoning

Function Calling

Multilingual

Math

Instruction Following

Model Specifications

Parameters671B

Active Params37B

ArchitectureMoE

Context Length128K tokens

ModalityText Only

Training CutoffEarly 2024

ProviderDeepSeek

Download Size688.7 GB

Community

Monthly Downloads1.2M

Likes4.1K

Last Updated1 years ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run deepseek-v3

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

DeepSeek LicenseView Full License

Performance & Scoring

Benchmarks

MMLU-PRO

64.4

GSM8K

89.3

Arena Score

81.2

Overall Score

69.2BB

Benchmark40%

78.3

Popularity25%

73.9

Efficiency20%

49.2

Versatility15%

63.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	52.1 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	59.8 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	63.5 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	68.0 GB	Excellent	Near-lossless quality with manageable size
Q8_0	77.2 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	112.4 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


NVIDIA H100 SXM5 80GBNVIDIA	SS	45.1 tok/s	59.8 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	49.8 tok/s	59.8 GB
NVIDIA H200 SXM 141GBNVIDIA	SS	64.6 tok/s	59.8 GB
Google Cloud TPU v5pGoogle	SS	37.2 tok/s	59.8 GB
AMD Instinct MI300XAMD	SS	71.3 tok/s	59.8 GB
Google TPU v7 (Ironwood)Google	SS	99.3 tok/s	59.8 GB
NVIDIA B200 GPUNVIDIA	SS	107.6 tok/s	59.8 GB
Intel Gaudi 2 AI AcceleratorIntel	SS	33.0 tok/s	59.8 GB
AMD Instinct MI325XAMD	SS	80.7 tok/s	59.8 GB
AMD Instinct MI355XAMD	SS	107.6 tok/s	59.8 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	95.5 tok/s	59.8 GB
Dell Pro Max with GB300Dell	SS	95.5 tok/s	59.8 GB
Gigabyte W775-V10-L01Gigabyte	SS	95.5 tok/s	59.8 GB
HP ZGX Fury AI StationHP	SS	95.5 tok/s	59.8 GB
MSI XpertStation WS300MSI	SS	95.5 tok/s	59.8 GB
SuperMicro Super AI StationSuperMicro	SS	95.5 tok/s	59.8 GB
NVIDIA A100 SXM4 80GBNVIDIA	SS	27.4 tok/s	59.8 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	AA	10.8 tok/s	59.8 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	8.3 tok/s	59.8 GB
MacBook Pro 16-inch M5 Max (2026)Apple	BB	8.3 tok/s	59.8 GB
MacBook Pro 16" M5 Max (2026)Apple	BB	8.3 tok/s	59.8 GB
Corsair AI Workstation 300 (Ryzen AI Max+ 395)Corsair	BB	6.9 tok/s	59.8 GB
Apple M4 Max (40-core GPU)Apple	BB	7.3 tok/s	59.8 GB
Apple Mac Studio (M4 Max, 2025)Apple	BB	7.3 tok/s	59.8 GB
MacBook Pro 14-inch M4 Max (2024)Apple	BB	7.3 tok/s	59.8 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on Apple M4 Pro (14-core CPU, 20-core GPU) (~3.7 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)DeepSeek-V3 on Apple M4 Pro (14-core CPU, 20-core GPU) · ~3.7 tok/s · 60W	$0.545
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 60 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA A100 80GB SXMVast.ai · On-Demand · 80 GB VRAM	$0.27
AMD Instinct MI300XRunPod · Community · 192 GB VRAM	$0.50
NVIDIA H200 NVLRunPod · Community · 141 GB VRAM	$0.50
NVIDIA A100 80GB SXMVast.ai · Spot · 80 GB VRAM	$0.53
NVIDIA A100 80GB PCIeVast.ai · Spot · 80 GB VRAM	$1.02

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Architecture & Technical Details

Mixture-of-Experts (MoE) Efficiency

Multi-head Latent Attention (MLA)

Training and Data

Capabilities & Use Cases

DeepSeek-V3 is a general-purpose model with specific tuning for high-complexity tasks. It is not merely a chatbot; it is a functional engine for automated workflows and technical development.

DeepSeek-V3 for Coding: The model excels at code generation, refactoring, and debugging across dozens of languages. It inherits the strengths of the DeepSeek-Coder series, making it highly effective for boilerplate generation, unit test creation, and explaining complex legacy codebases.
Reasoning and Mathematics: In DeepSeek-V3 reasoning benchmarks, the model consistently rivals or beats GPT-4o in MATH and GSM8K datasets. This makes it ideal for local deployments involving data science, financial modeling, or architectural planning.
Function-Calling and Tool Use: DeepSeek-V3 features robust support for function-calling, allowing it to act as an agentic core. It can parse user intent and output structured JSON to interact with external APIs, databases, or local file systems.
Multilingual Support: While many models are English-centric, DeepSeek-V3 provides high-tier performance in Chinese, English, and several other major languages, making it a viable option for global applications.

Running DeepSeek-V3 Locally

DeepSeek-V3 Hardware Requirements

To determine the best GPU for DeepSeek-V3, you must first decide on your quantization level.

Quantization	VRAM Required (Approx.)	Recommended Hardware
FP8 (Native)	~700 GB	8x H100 (80GB) or 16x A100 (40GB)
Q4_K_M (GGUF)	~390 GB	10x-12x RTX 3090/4090 (24GB) or Mac Studio M2/M3 Ultra (192GB + Swapping)
Q2_K (GGUF)	~210 GB	8x-9x RTX 3090/4090 (24GB) or Mac Studio M2/M3 Ultra (192GB)

How to run 671B model on consumer GPUs

For most practitioners, the only way to run DeepSeek-V3 locally on consumer hardware is through a multi-GPU cluster or a high-spec Mac.

NVIDIA RTX 3090/4090 Clusters: You will need at least 8x 24GB GPUs to run a Q2 quantization. This requires specialized motherboards and power delivery systems.
Apple Silicon (Mac Studio/Mac Pro): An M2 Ultra or M3 Ultra with 192GB of Unified Memory is the most "accessible" way to run this model. While 192GB is technically short of the ~210GB needed for Q2, macOS can use swap memory to bridge the gap, though performance will degrade significantly once swapping begins.
Ollama: This is the quickest way to get started. By running ollama run deepseek-v3, the software will attempt to manage the memory allocation and quantization for you, though it will likely default to a heavily quantized version (IQ2_XXS) if your VRAM is limited.

Best Quantization for DeepSeek-V3

How It Compares

DeepSeek-V3 occupies a unique niche as a high-parameter MoE model. Here is how it stacks up against realistic alternatives:

DeepSeek-V3 vs. Llama 3.1 405B

Llama 3.1 405B is a dense model, meaning every parameter is used for every token.

Performance: Both models trade blows on benchmarks, with DeepSeek-V3 often leading in coding and math.
Inference Speed: DeepSeek-V3 is significantly faster during inference because it only activates 37B parameters. Llama 405B is computationally much heavier.
Hardware: Both require massive VRAM (350GB+ for 4-bit), but DeepSeek-V3 provides a better "intelligence-per-watt" ratio during active generation.

DeepSeek-V3 vs. Qwen2.5 72B

Qwen2.5 72B is a much smaller dense model that is easier to run on a single or dual-GPU setup.

Performance: DeepSeek-V3 is substantially more capable in complex reasoning and long-context retrieval (128k context vs Qwen's typical performance drop-off at high context).
Accessibility: Qwen2.5 72B can run on 2x RTX 3090s at Q4 quantization. DeepSeek-V3 requires a massive hardware investment by comparison.