made by agents

Powerful MoE model with 671B total / 37B active, trained on 14.8T tokens for only $5.6M. Uses Multi-head Latent Attention. Comparable to GPT-4o at release.
Copy and paste this command to start running the model locally.
ollama run deepseek-v3Access model weights, configuration files, and documentation.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 52.1 GB | Low | Aggressive quantization — smallest size, noticeable quality loss |
| Q4_K_MRecommended | 59.8 GB | Good | Best balance of size and quality for most use-cases |
| Q5_K_M | 63.5 GB | Very Good | Slightly better quality than Q4 with moderate size increase |
| Q6_K | 68.0 GB | Excellent | Near-lossless quality with manageable size |
| Q8_0 | 77.2 GB | Near Perfect | Virtually indistinguishable from full precision |
| FP16 | 112.4 GB | Full | Full 16-bit floating point — maximum quality, largest size |
See which devices can run this model and at what quality level.
NVIDIA H100 SXM5 80GBNVIDIA | SS | 45.1 tok/s | 59.8 GB | |
| SS | 49.8 tok/s | 59.8 GB | ||
NVIDIA H200 SXM 141GBNVIDIA | SS | 64.6 tok/s | 59.8 GB | |
Google Cloud TPU v5pGoogle | SS | 37.2 tok/s | 59.8 GB | |
| SS | 71.3 tok/s | 59.8 GB | ||
NVIDIA B200 GPUNVIDIA | SS | 107.6 tok/s | 59.8 GB | |
| SS | 33.0 tok/s | 59.8 GB | ||
| SS | 80.7 tok/s | 59.8 GB | ||
| SS | 107.6 tok/s | 59.8 GB | ||
NVIDIA A100 SXM4 80GBNVIDIA | SS | 27.4 tok/s | 59.8 GB | |
| AA | 10.8 tok/s | 59.8 GB | ||
| BB | 8.3 tok/s | 59.8 GB | ||
| BB | 8.3 tok/s | 59.8 GB | ||
| BB | 8.3 tok/s | 59.8 GB | ||
| BB | 7.3 tok/s | 59.8 GB | ||
| BB | 7.3 tok/s | 59.8 GB | ||
| BB | 7.3 tok/s | 59.8 GB | ||
| BB | 7.3 tok/s | 59.8 GB | ||
| BB | 10.8 tok/s | 59.8 GB | ||
| BB | 5.4 tok/s | 59.8 GB | ||
| BB | 3.7 tok/s | 59.8 GB | ||
| BB | 11.0 tok/s | 59.8 GB | ||
| BB | 11.0 tok/s | 59.8 GB | ||
| CC | 5.4 tok/s | 59.8 GB | ||
| CC | 4.1 tok/s | 59.8 GB |
DeepSeek-V3 represents a significant shift in the landscape of open-weights large language models (LLMs). Developed by DeepSeek, it is a 671B parameter Mixture-of-Experts (MoE) model designed to compete directly with frontier models like GPT-4o and Claude 3.5 Sonnet. While its total parameter count is massive, the architecture is highly optimized for inference efficiency, utilizing only 37B active parameters per token.
For practitioners looking to run DeepSeek-V3 locally, the challenge is not compute—which is handled efficiently by the MoE architecture—but memory. At 671B parameters, this model pushes the boundaries of what is possible on local hardware, requiring significant VRAM or high-capacity unified memory systems. It is positioned as the premier open-weights choice for developers requiring high-end reasoning, complex coding capabilities, and deep multilingual support without relying on proprietary APIs.
The defining characteristic of DeepSeek-V3 is its Mixture-of-Experts (MoE) architecture coupled with Multi-head Latent Attention (MLA). This combination addresses the two primary bottlenecks in local LLM deployment: computational cost and KV cache size.
In a standard dense model, every parameter is activated for every token. In DeepSeek-V3’s MoE setup, the model contains 671B total parameters, but only 37B are "active" during the forward pass. This DeepSeek-V3 MoE efficiency allows the model to deliver the reasoning capabilities of a 600B+ parameter model while maintaining the inference latency of a much smaller 37B parameter model. If you have the VRAM to house the weights, the DeepSeek-V3 tokens per second will be surprisingly high compared to dense models like Llama 3.1 405B.
DeepSeek-V3 utilizes MLA to significantly compress the KV (Key-Value) cache. In traditional Multi-Head Attention (MHA), the KV cache grows linearly with context length and model size, often becoming the primary bottleneck for long-context inference. MLA compresses the KV cache into a latent vector, reducing the memory footprint of the 128,000-token context window. This makes it feasible to run long-context queries on hardware that would otherwise run out of memory.
The model was trained on 14.8 trillion tokens with a training cutoff in early 2024. Remarkably, DeepSeek achieved this for a total training cost of approximately $5.6M, showcasing extreme algorithmic efficiency. The model supports a wide array of programming languages and demonstrates state-of-the-art performance in mathematics and logic.
DeepSeek-V3 is a general-purpose model with specific tuning for high-complexity tasks. It is not merely a chatbot; it is a functional engine for automated workflows and technical development.
Running a 671B parameter model locally is a significant engineering undertaking. The primary hurdle is the DeepSeek-V3 VRAM requirements. Because the model must be loaded into memory to achieve usable speeds, consumer-grade hardware requires aggressive quantization.
To determine the best GPU for DeepSeek-V3, you must first decide on your quantization level.
| Quantization | VRAM Required (Approx.) | Recommended Hardware |
| :--- | :--- | :--- |
| FP8 (Native) | ~700 GB | 8x H100 (80GB) or 16x A100 (40GB) |
| Q4_K_M (GGUF) | ~390 GB | 10x-12x RTX 3090/4090 (24GB) or Mac Studio M2/M3 Ultra (192GB + Swapping) |
| Q2_K (GGUF) | ~210 GB | 8x-9x RTX 3090/4090 (24GB) or Mac Studio M2/M3 Ultra (192GB) |
For most practitioners, the only way to run DeepSeek-V3 locally on consumer hardware is through a multi-GPU cluster or a high-spec Mac.
ollama run deepseek-v3, the software will attempt to manage the memory allocation and quantization for you, though it will likely default to a heavily quantized version (IQ2_XXS) if your VRAM is limited.For a model of this size, Q4_K_M is generally considered the "gold standard" for maintaining intelligence. However, the jump from Q2 to Q4 almost doubles the VRAM requirement. For local practitioners, Q2_K or IQ2_M is often the only realistic choice. Due to the massive parameter count, even a Q2 quantization of DeepSeek-V3 often outperforms a Q8 quantization of a 70B model.
DeepSeek-V3 occupies a unique niche as a high-parameter MoE model. Here is how it stacks up against realistic alternatives:
Llama 3.1 405B is a dense model, meaning every parameter is used for every token.
Qwen2.5 72B is a much smaller dense model that is easier to run on a single or dual-GPU setup.
DeepSeek-V3 is the local AI model 671B parameters 2025 benchmark. It is intended for users who have moved past the capabilities of 70B-class models and have the hardware infrastructure to support a massive memory footprint in exchange for frontier-level reasoning and coding performance.