
Cost-efficient 284B total / 13B active MoE language model with native 1M-token context. Shares the hybrid attention architecture (CSA + HCA) and Muon-trained backbone of V4-Pro at a fraction of the cost. Reasoning closely approaches V4-Pro (GPQA Diamond 88.1, LiveCodeBench 91.6 in Max mode) while delivering faster response times and dramatically cheaper API pricing.
A solid 284B-parameter MoE language model from DeepSeek. Pulls ahead on graduate-level reasoning (GPQA) (89/100), so reach for it when that's the dimension that matters.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 109.3 GB | Low | |
| Q4_K_MRecommended | 112.0 GB | Good | |
| Q5_K_M | 113.3 GB | Very Good | |
| Q6_K | 114.9 GB | Excellent | |
| Q8_0 | 118.2 GB | Near Perfect | |
| FP16 | 130.5 GB | Full |
See which devices can run this model and at what quality level.
Google TPU v7 (Ironwood)Google | SS | 53.0 tok/s | 112.0 GB | |
NVIDIA B200 GPUNVIDIA | SS | 57.5 tok/s | 112.0 GB | |
| SS | 43.1 tok/s | 112.0 GB | ||
| SS | 38.1 tok/s | 112.0 GB | ||
| SS | 57.5 tok/s | 112.0 GB | ||
NVIDIA H200 SXM 141GBNVIDIA | SS | 34.5 tok/s | 112.0 GB | |
| SS | 51.0 tok/s | 112.0 GB | ||
| SS | 51.0 tok/s | 112.0 GB | ||
| SS | 51.0 tok/s | 112.0 GB | ||
| SS | 51.0 tok/s | 112.0 GB | ||
SuperMicro Super AI StationSuperMicro | SS | 51.0 tok/s | 112.0 GB | |
Gigabyte W775-V10-L01Gigabyte | SS | 51.0 tok/s | 112.0 GB | |
| AA | 26.6 tok/s | 112.0 GB | ||
| BB | 5.7 tok/s | 112.0 GB | ||
| BB | 5.9 tok/s | 112.0 GB | ||
| BB | 5.9 tok/s | 112.0 GB | ||
| BB | 5.7 tok/s | 112.0 GB | ||
| BB | 4.4 tok/s | 112.0 GB | ||
| BB | 4.4 tok/s | 112.0 GB | ||
| BB | 4.4 tok/s | 112.0 GB | ||
| BB | 3.9 tok/s | 112.0 GB | ||
| BB | 3.9 tok/s | 112.0 GB | ||
| BB | 3.9 tok/s | 112.0 GB | ||
| BB | 3.9 tok/s | 112.0 GB | ||
| BB | 2.0 tok/s | 112.0 GB |
Energy cost on Acer Veriton GN100 AI Mini (~2.0 tok/s, Q4_K_M) vs flagship API pricing.
| Source | Cost per 1M tokens |
|---|---|
Local (energy only)DeepSeek-V4-Flash on Acer Veriton GN100 AI Mini · ~2.0 tok/s · 140W | $2.38 |
GPT-5.5OpenAI · in $5.00 · out $30.00 | $12.50 |
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00 | $11.00 |
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00 | $3.75 |
Grok 4.3xAI · in $1.25 · out $2.50 | $1.63 |
API prices blended at 70% input / 30% output.
Hardware amortisation not included. Run the full ROI calculator for payback math.
Cheapest current cloud rentals with at least 112 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
AMD Instinct MI300XRunPod · Community · 192 GB VRAM | $0.50 |
NVIDIA H200 NVLRunPod · Community · 141 GB VRAM | $0.50 |
NVIDIA H200 SXMVast.ai · Spot · 141 GB VRAM | $1.32 |
NVIDIA H200 NVLVast.ai · Spot · 141 GB VRAM | $1.33 |
NVIDIA H200 NVLVast.ai · On-Demand · 141 GB VRAM | $1.94 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
DeepSeek-V4-Flash is a Mixture-of-Experts language model from DeepSeek, released in April 2026 as the cost-efficient sibling to V4-Pro. With 284B total parameters and only 13B active at inference, it occupies a specific niche: near-frontier reasoning capability at a fraction of the compute cost. It shares the same hybrid attention architecture, Muon-trained backbone, and 1M-token context window as V4-Pro, but routes through a smaller active parameter set for faster generation and dramatically lower resource requirements.
This is not a "lite" model in the traditional sense. On GPQA Diamond, Flash scores 88.1 in Max mode; on LiveCodeBench, it hits 91.6. Those numbers put it within spitting distance of V4-Pro (93.5) while using roughly a quarter of the active parameters. For practitioners who need strong reasoning and coding performance but can't justify the hardware or API spend for a 1.6T-parameter model, Flash is the practical choice.
The model is text-only, licensed under MIT, and supports chat, code generation, reasoning, function-calling, multilingual text, math, creative writing, and instruction-following. It's available as open weights on Hugging Face and runs on standard inference frameworks.
DeepSeek-V4-Flash uses a Mixture-of-Experts (MoE) transformer with 284B total parameters, of which 13B are active per token. This is the key efficiency lever: the model has 284B parameters spread across expert modules, but only a subset are activated for any given forward pass. The result is inference speed and memory usage comparable to a dense 13B model, while the effective capacity approaches that of a much larger model.
The architecture incorporates two major innovations from the V4 series:
The 1M-token context is native, not an afterthought. Flash supports it out of the box without sliding window tricks or context compression hacks. For document analysis, codebase-level reasoning, or multi-turn agentic workflows, this is a practical differentiator against models that cap at 128K or 256K.
DeepSeek-V4-Flash is strongest in coding and reasoning tasks. On SWE-bench Verified, it scores 79.0% — within 1.6 points of V4-Pro's 80.6%. For practical agentic coding workflows, this gap is often imperceptible. The model handles function-calling natively, making it suitable for tool-use agents and automated code generation pipelines.
For reasoning-heavy workloads — math, STEM problem-solving, complex multi-step logic — Flash performs comparably to much larger models. The 88.1 GPQA Diamond score places it ahead of most open-weight models at any size, and the 91.6 LiveCodeBench result confirms it handles competitive programming and algorithmic reasoning well.
Multilingual support is broad, with strong performance across major language families. Creative writing and instruction-following are competent but not standout — this is a reasoning-first model, not a creative writing specialist.
Concrete use cases:
This is a 284B MoE model with 13B active parameters. The practical reality for local deployment depends entirely on quantization and hardware.
| Quantization | Minimum VRAM | Recommended VRAM |
|---|---|---|
| Q4_K_M | ~24 GB | 32 GB |
| Q5_K_M | ~30 GB | 36 GB |
| Q8_0 | ~48 GB | 64 GB |
| FP16 | ~96 GB | 128 GB |
The MoE architecture is your friend here. Because only 13B parameters are active per token, the model behaves like a 13B dense model in terms of compute. The VRAM requirement comes from loading the full 284B parameter set, but quantization reduces this significantly.
For most users, Q4_K_M is the sweet spot. It fits on a single 24 GB GPU, preserves model quality within 1-2% of FP16 on reasoning benchmarks, and delivers usable inference speeds. If you have 32 GB or more, Q5_K_M is worth the slight quality bump. Q8_0 is overkill for most workloads unless you need maximum precision for agentic tool-calling.
Ollama is the quickest path to local deployment. DeepSeek-V4-Flash is available in the Ollama model library with pre-configured quantization profiles. A typical workflow:
1ollama pull deepseek-v4-flash:q4_k_m2ollama run deepseek-v4-flash:q4_k_m
For custom inference, the model loads in any framework that supports MoE and GGUF, including llama.cpp, LM Studio, and text-generation-webui. The Hugging Face checkpoint is also available for Transformers-based inference, though you'll need significant VRAM for unquantized weights.
DeepSeek-V4-Flash vs. Qwen3-235B-A22B (MoE)
Both are MoE models in a similar size class. Qwen3-235B-A22B has 22B active parameters vs. Flash's 13B, which means higher compute requirements per token but potentially better performance on some benchmarks. In practice, Flash matches or exceeds Qwen3 on coding and reasoning benchmarks (GPQA Diamond 88.1 vs. ~85 for Qwen3). Flash's 1M context window is a clear advantage over Qwen3's 128K. Choose Flash if you need long-context capability or faster inference per token; choose Qwen3 if you have the VRAM headroom and want slightly broader knowledge coverage.
DeepSeek-V4-Flash vs. Llama 4 400B (dense)
Llama 4 400B is a dense model at roughly the same total parameter count, but with no MoE routing. This means every forward pass activates all 400B parameters, requiring ~800 GB at FP16. Flash's MoE architecture makes it dramatically more practical for local deployment — 13B active vs. 400B active. On quality, Flash leads on reasoning benchmarks (GPQA Diamond 88.1 vs. ~78 for Llama 4). Llama 4 has a larger training corpus and broader world knowledge, but for coding and reasoning workloads, Flash is the better choice, especially given the hardware feasibility gap.
When to use Flash: You need strong reasoning and coding capability, you want to run it locally on consumer or prosumer hardware, and long context (up to 1M tokens) is a requirement. The MIT license also makes it suitable for commercial deployment without restrictions.

Explore the Provider
Aggregate stats, leaderboard, release timeline, and benchmark coverage across every DeepSeek model we track.

Explore the Family
The full DeepSeek family leaderboard with sizes, benchmark scores, and a release timeline.