DeepSeek

DeepSeek-V4-Flash

Cost-efficient 284B total / 13B active MoE language model with native 1M-token context. Shares the hybrid attention architecture (CSA + HCA) and Muon-trained backbone of V4-Pro at a fraction of the cost. Reasoning closely approaches V4-Pro (GPQA Diamond 88.1, LiveCodeBench 91.6 in Max mode) while delivering faster response times and dramatically cheaper API pricing.

284B paramsMoE1000K ctx

View on Hugging Face Source Code Official Page

Our Take

Best for: Strongest at graduate-level reasoning (GPQA) in its size class

A solid 284B-parameter MoE language model from DeepSeek. Pulls ahead on graduate-level reasoning (GPQA) (89/100), so reach for it when that's the dimension that matters.

Run this onNVIDIA H200 SXM 141GBCheapest card in our directory with comfortable headroom (141 GB) for this model at Q4 (~112.0 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Reasoning

Function Calling

Multilingual

Math

Creative Writing

Instruction Following

Model Specifications

Parameters284B

Active Params13B

ArchitectureMoE

Context Length1M tokens

ModalityText Only

ProviderDeepSeek

Download Size159.6 GB

Community

Monthly Downloads2.1M

Likes1.6K

Last Updated4 days ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

MITView Full License

Performance & Scoring

Benchmarks

89.4

86.4

79.0

32.1

56.9

AA Intelligence Index

40.3

44.9

35.6

79.2

63.0

89.3

MBA Open Score

61.7BB

Benchmark40%

63.3

Popularity25%

74.2

Efficiency20%

25.4

Versatility15%

85.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	109.3 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	112.0 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	113.3 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	114.9 GB	Excellent	Near-lossless quality with manageable size
Q8_0	118.2 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	130.5 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


Google TPU v7 (Ironwood)Google	SS	53.0 tok/s	112.0 GB
NVIDIA B200 GPUNVIDIA	SS	57.5 tok/s	112.0 GB
AMD Instinct MI325XAMD	SS	43.1 tok/s	112.0 GB
AMD Instinct MI300XAMD	SS	38.1 tok/s	112.0 GB
AMD Instinct MI355XAMD	SS	57.5 tok/s	112.0 GB
NVIDIA H200 SXM 141GBNVIDIA	SS	34.5 tok/s	112.0 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	51.0 tok/s	112.0 GB
Dell Pro Max with GB300Dell	SS	51.0 tok/s	112.0 GB
HP ZGX Fury AI StationHP	SS	51.0 tok/s	112.0 GB
MSI XpertStation WS300MSI	SS	51.0 tok/s	112.0 GB
SuperMicro Super AI StationSuperMicro	SS	51.0 tok/s	112.0 GB
Gigabyte W775-V10-L01Gigabyte	SS	51.0 tok/s	112.0 GB
Intel Gaudi 3 AI AcceleratorIntel	AA	26.6 tok/s	112.0 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	5.7 tok/s	112.0 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	BB	5.9 tok/s	112.0 GB
Apple Mac Studio (M3 Ultra, 2025)Apple	BB	5.9 tok/s	112.0 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	BB	5.7 tok/s	112.0 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	4.4 tok/s	112.0 GB
MacBook Pro 16-inch M5 Max (2026)Apple	BB	4.4 tok/s	112.0 GB
MacBook Pro 16" M5 Max (2026)Apple	BB	4.4 tok/s	112.0 GB
Apple M4 Max (40-core GPU)Apple	BB	3.9 tok/s	112.0 GB
Apple Mac Studio (M4 Max, 2025)Apple	BB	3.9 tok/s	112.0 GB
MacBook Pro 14-inch M4 Max (2024)Apple	BB	3.9 tok/s	112.0 GB
MacBook Pro 16" M4 Max (2024)Apple	BB	3.9 tok/s	112.0 GB
Acer Veriton GN100 AI MiniAcer	BB	2.0 tok/s	112.0 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on Acer Veriton GN100 AI Mini (~2.0 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)DeepSeek-V4-Flash on Acer Veriton GN100 AI Mini · ~2.0 tok/s · 140W	$2.38
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 112 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
AMD Instinct MI300XRunPod · Community · 192 GB VRAM	$0.50
NVIDIA H200 NVLRunPod · Community · 141 GB VRAM	$0.50
NVIDIA H200 NVLVast.ai · Spot · 141 GB VRAM	$1.67
NVIDIA H200 SXMVast.ai · Spot · 141 GB VRAM	$1.93
NVIDIA H200 NVLVast.ai · On-Demand · 141 GB VRAM	$1.94

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

DeepSeek-V4-Flash is a Mixture-of-Experts language model from DeepSeek, released in April 2026 as the cost-efficient sibling to V4-Pro. With 284B total parameters and only 13B active at inference, it occupies a specific niche: near-frontier reasoning capability at a fraction of the compute cost. It shares the same hybrid attention architecture, Muon-trained backbone, and 1M-token context window as V4-Pro, but routes through a smaller active parameter set for faster generation and dramatically lower resource requirements.

This is not a "lite" model in the traditional sense. On GPQA Diamond, Flash scores 88.1 in Max mode; on LiveCodeBench, it hits 91.6. Those numbers put it within spitting distance of V4-Pro (93.5) while using roughly a quarter of the active parameters. For practitioners who need strong reasoning and coding performance but can't justify the hardware or API spend for a 1.6T-parameter model, Flash is the practical choice.

The model is text-only, licensed under MIT, and supports chat, code generation, reasoning, function-calling, multilingual text, math, creative writing, and instruction-following. It's available as open weights on Hugging Face and runs on standard inference frameworks.

Architecture & Technical Details

DeepSeek-V4-Flash uses a Mixture-of-Experts (MoE) transformer with 284B total parameters, of which 13B are active per token. This is the key efficiency lever: the model has 284B parameters spread across expert modules, but only a subset are activated for any given forward pass. The result is inference speed and memory usage comparable to a dense 13B model, while the effective capacity approaches that of a much larger model.

The architecture incorporates two major innovations from the V4 series:

Hybrid Attention (CSA + HCA): Compressed Sparse Attention handles long-range dependencies efficiently, while Heavily Compressed Attention further reduces KV cache overhead. At 1M-token context, this architecture requires roughly 27% of the inference FLOPs and 10% of the KV cache compared to DeepSeek-V3.2. For local deployment, this means the model can actually utilize its full 1M context window without memory exploding.

Muon Optimizer: The model was trained using the Muon optimizer rather than standard AdamW. This is relevant for local deployment only insofar as it affects the model's sensitivity to quantization and inference precision — Muon-trained models tend to be more robust to lower-bit representations.

Manifold-Constrained Hyper-Connections (mHC): Improves signal propagation across layers, which contributes to the model's ability to maintain coherence over very long contexts.

The 1M-token context is native, not an afterthought. Flash supports it out of the box without sliding window tricks or context compression hacks. For document analysis, codebase-level reasoning, or multi-turn agentic workflows, this is a practical differentiator against models that cap at 128K or 256K.

Capabilities & Use Cases

DeepSeek-V4-Flash is strongest in coding and reasoning tasks. On SWE-bench Verified, it scores 79.0% — within 1.6 points of V4-Pro's 80.6%. For practical agentic coding workflows, this gap is often imperceptible. The model handles function-calling natively, making it suitable for tool-use agents and automated code generation pipelines.

For reasoning-heavy workloads — math, STEM problem-solving, complex multi-step logic — Flash performs comparably to much larger models. The 88.1 GPQA Diamond score places it ahead of most open-weight models at any size, and the 91.6 LiveCodeBench result confirms it handles competitive programming and algorithmic reasoning well.

Multilingual support is broad, with strong performance across major language families. Creative writing and instruction-following are competent but not standout — this is a reasoning-first model, not a creative writing specialist.

Concrete use cases:

Agentic coding assistants: Flash powers DeepSeek's own in-house agentic coding tools. For local deployment, it pairs well with frameworks like Claude Code, OpenClaw, or OpenCode.
Long-context document analysis: The 1M context enables processing of entire codebases, lengthy technical reports, or multi-document retrieval without chunking.
Complex reasoning pipelines: Math proof verification, multi-step planning, and tool-using agents that require sustained logical coherence.
Cost-sensitive API replacement: If you're currently paying for a tier-1 API for coding tasks, Flash can match most of that capability locally or at a fraction of the cost.

Running DeepSeek-V4-Flash Locally

This is a 284B MoE model with 13B active parameters. The practical reality for local deployment depends entirely on quantization and hardware.

VRAM Requirements

Quantization	Minimum VRAM	Recommended VRAM
Q4_K_M	~24 GB	32 GB
Q5_K_M	~30 GB	36 GB
Q8_0	~48 GB	64 GB
FP16	~96 GB	128 GB

The MoE architecture is your friend here. Because only 13B parameters are active per token, the model behaves like a 13B dense model in terms of compute. The VRAM requirement comes from loading the full 284B parameter set, but quantization reduces this significantly.

What Hardware Can Run It

Consumer GPUs: A single RTX 4090 (24 GB) can run Q4_K_M with some headroom. Expect 10-15 tokens/second at Q4_K_M on a 4090. Two RTX 3090s or 4090s in parallel can handle Q5_K_M or Q8_0 comfortably.
Workstation GPUs: RTX 6000 Ada (48 GB) or A6000 (48 GB) can run Q8_0 on a single card. Two A6000s handle FP16.
Apple Silicon: An M4 Max with 64 GB unified memory can run Q4_K_M or Q5_K_M. Expect 5-8 tokens/second depending on prompt complexity. M4 Ultra (128 GB) can run Q8_0 or FP16.
Multi-GPU setups: The model splits well across multiple GPUs. Two RTX 4090s at Q8_0 will outperform a single A100 for inference throughput.

Recommended Quantization

For most users, Q4_K_M is the sweet spot. It fits on a single 24 GB GPU, preserves model quality within 1-2% of FP16 on reasoning benchmarks, and delivers usable inference speeds. If you have 32 GB or more, Q5_K_M is worth the slight quality bump. Q8_0 is overkill for most workloads unless you need maximum precision for agentic tool-calling.

Getting Started

Ollama is the quickest path to local deployment. DeepSeek-V4-Flash is available in the Ollama model library with pre-configured quantization profiles. A typical workflow:

1ollama pull deepseek-v4-flash:q4_k_m
2ollama run deepseek-v4-flash:q4_k_m

For custom inference, the model loads in any framework that supports MoE and GGUF, including llama.cpp, LM Studio, and text-generation-webui. The Hugging Face checkpoint is also available for Transformers-based inference, though you'll need significant VRAM for unquantized weights.

How It Compares

DeepSeek-V4-Flash vs. Qwen3-235B-A22B (MoE)

Both are MoE models in a similar size class. Qwen3-235B-A22B has 22B active parameters vs. Flash's 13B, which means higher compute requirements per token but potentially better performance on some benchmarks. In practice, Flash matches or exceeds Qwen3 on coding and reasoning benchmarks (GPQA Diamond 88.1 vs. ~85 for Qwen3). Flash's 1M context window is a clear advantage over Qwen3's 128K. Choose Flash if you need long-context capability or faster inference per token; choose Qwen3 if you have the VRAM headroom and want slightly broader knowledge coverage.

DeepSeek-V4-Flash vs. Llama 4 400B (dense)

Llama 4 400B is a dense model at roughly the same total parameter count, but with no MoE routing. This means every forward pass activates all 400B parameters, requiring ~800 GB at FP16. Flash's MoE architecture makes it dramatically more practical for local deployment — 13B active vs. 400B active. On quality, Flash leads on reasoning benchmarks (GPQA Diamond 88.1 vs. ~78 for Llama 4). Llama 4 has a larger training corpus and broader world knowledge, but for coding and reasoning workloads, Flash is the better choice, especially given the hardware feasibility gap.

When to use Flash: You need strong reasoning and coding capability, you want to run it locally on consumer or prosumer hardware, and long context (up to 1M tokens) is a requirement. The MIT license also makes it suitable for commercial deployment without restrictions.

Related Models

DeepSeek

Explore the Provider

See all DeepSeek models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every DeepSeek model we track.

Open DeepSeek

Explore the Family

See every DeepSeek release

The full DeepSeek family leaderboard with sizes, benchmark scores, and a release timeline.

Open DeepSeek

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

DeepSeek

DeepSeek-V4-Flash

284B paramsMoE1000K ctx

View on Hugging Face Source Code Official Page

Our Take

Best for: Strongest at graduate-level reasoning (GPQA) in its size class

A solid 284B-parameter MoE language model from DeepSeek. Pulls ahead on graduate-level reasoning (GPQA) (89/100), so reach for it when that's the dimension that matters.

Run this onNVIDIA H200 SXM 141GBCheapest card in our directory with comfortable headroom (141 GB) for this model at Q4 (~112.0 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Reasoning

Function Calling

Multilingual

Math

Creative Writing

Instruction Following

Model Specifications

Parameters284B

Active Params13B

ArchitectureMoE

Context Length1M tokens

ModalityText Only

ProviderDeepSeek

Download Size159.6 GB

Community

Monthly Downloads2.1M

Likes1.6K

Last Updated4 days ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

MITView Full License

Performance & Scoring

Benchmarks

89.4

86.4

79.0

32.1

56.9

AA Intelligence Index

40.3

44.9

35.6

79.2

63.0

89.3

MBA Open Score

61.7BB

Benchmark40%

63.3

Popularity25%

74.2

Efficiency20%

25.4

Versatility15%

85.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	109.3 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	112.0 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	113.3 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	114.9 GB	Excellent	Near-lossless quality with manageable size
Q8_0	118.2 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	130.5 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


Google TPU v7 (Ironwood)Google	SS	53.0 tok/s	112.0 GB
NVIDIA B200 GPUNVIDIA	SS	57.5 tok/s	112.0 GB
AMD Instinct MI325XAMD	SS	43.1 tok/s	112.0 GB
AMD Instinct MI300XAMD	SS	38.1 tok/s	112.0 GB
AMD Instinct MI355XAMD	SS	57.5 tok/s	112.0 GB
NVIDIA H200 SXM 141GBNVIDIA	SS	34.5 tok/s	112.0 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	51.0 tok/s	112.0 GB
Dell Pro Max with GB300Dell	SS	51.0 tok/s	112.0 GB
HP ZGX Fury AI StationHP	SS	51.0 tok/s	112.0 GB
MSI XpertStation WS300MSI	SS	51.0 tok/s	112.0 GB
SuperMicro Super AI StationSuperMicro	SS	51.0 tok/s	112.0 GB
Gigabyte W775-V10-L01Gigabyte	SS	51.0 tok/s	112.0 GB
Intel Gaudi 3 AI AcceleratorIntel	AA	26.6 tok/s	112.0 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	5.7 tok/s	112.0 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	BB	5.9 tok/s	112.0 GB
Apple Mac Studio (M3 Ultra, 2025)Apple	BB	5.9 tok/s	112.0 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	BB	5.7 tok/s	112.0 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	4.4 tok/s	112.0 GB
MacBook Pro 16-inch M5 Max (2026)Apple	BB	4.4 tok/s	112.0 GB
MacBook Pro 16" M5 Max (2026)Apple	BB	4.4 tok/s	112.0 GB
Apple M4 Max (40-core GPU)Apple	BB	3.9 tok/s	112.0 GB
Apple Mac Studio (M4 Max, 2025)Apple	BB	3.9 tok/s	112.0 GB
MacBook Pro 14-inch M4 Max (2024)Apple	BB	3.9 tok/s	112.0 GB
MacBook Pro 16" M4 Max (2024)Apple	BB	3.9 tok/s	112.0 GB
Acer Veriton GN100 AI MiniAcer	BB	2.0 tok/s	112.0 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on Acer Veriton GN100 AI Mini (~2.0 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)DeepSeek-V4-Flash on Acer Veriton GN100 AI Mini · ~2.0 tok/s · 140W	$2.38
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 112 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
AMD Instinct MI300XRunPod · Community · 192 GB VRAM	$0.50
NVIDIA H200 NVLRunPod · Community · 141 GB VRAM	$0.50
NVIDIA H200 NVLVast.ai · Spot · 141 GB VRAM	$1.67
NVIDIA H200 SXMVast.ai · Spot · 141 GB VRAM	$1.93
NVIDIA H200 NVLVast.ai · On-Demand · 141 GB VRAM	$1.94

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

Architecture & Technical Details

The architecture incorporates two major innovations from the V4 series:

Hybrid Attention (CSA + HCA): Compressed Sparse Attention handles long-range dependencies efficiently, while Heavily Compressed Attention further reduces KV cache overhead. At 1M-token context, this architecture requires roughly 27% of the inference FLOPs and 10% of the KV cache compared to DeepSeek-V3.2. For local deployment, this means the model can actually utilize its full 1M context window without memory exploding.

Muon Optimizer: The model was trained using the Muon optimizer rather than standard AdamW. This is relevant for local deployment only insofar as it affects the model's sensitivity to quantization and inference precision — Muon-trained models tend to be more robust to lower-bit representations.

Manifold-Constrained Hyper-Connections (mHC): Improves signal propagation across layers, which contributes to the model's ability to maintain coherence over very long contexts.

Capabilities & Use Cases

Concrete use cases:

Agentic coding assistants: Flash powers DeepSeek's own in-house agentic coding tools. For local deployment, it pairs well with frameworks like Claude Code, OpenClaw, or OpenCode.
Long-context document analysis: The 1M context enables processing of entire codebases, lengthy technical reports, or multi-document retrieval without chunking.
Complex reasoning pipelines: Math proof verification, multi-step planning, and tool-using agents that require sustained logical coherence.
Cost-sensitive API replacement: If you're currently paying for a tier-1 API for coding tasks, Flash can match most of that capability locally or at a fraction of the cost.

Running DeepSeek-V4-Flash Locally

This is a 284B MoE model with 13B active parameters. The practical reality for local deployment depends entirely on quantization and hardware.

VRAM Requirements

Quantization	Minimum VRAM	Recommended VRAM
Q4_K_M	~24 GB	32 GB
Q5_K_M	~30 GB	36 GB
Q8_0	~48 GB	64 GB
FP16	~96 GB	128 GB

What Hardware Can Run It

Consumer GPUs: A single RTX 4090 (24 GB) can run Q4_K_M with some headroom. Expect 10-15 tokens/second at Q4_K_M on a 4090. Two RTX 3090s or 4090s in parallel can handle Q5_K_M or Q8_0 comfortably.
Workstation GPUs: RTX 6000 Ada (48 GB) or A6000 (48 GB) can run Q8_0 on a single card. Two A6000s handle FP16.
Apple Silicon: An M4 Max with 64 GB unified memory can run Q4_K_M or Q5_K_M. Expect 5-8 tokens/second depending on prompt complexity. M4 Ultra (128 GB) can run Q8_0 or FP16.
Multi-GPU setups: The model splits well across multiple GPUs. Two RTX 4090s at Q8_0 will outperform a single A100 for inference throughput.

Recommended Quantization

Getting Started

Ollama is the quickest path to local deployment. DeepSeek-V4-Flash is available in the Ollama model library with pre-configured quantization profiles. A typical workflow:

1ollama pull deepseek-v4-flash:q4_k_m
2ollama run deepseek-v4-flash:q4_k_m

How It Compares

DeepSeek-V4-Flash vs. Qwen3-235B-A22B (MoE)

DeepSeek-V4-Flash vs. Llama 4 400B (dense)

Related Models

DeepSeek

DeepSeek-V4-Pro

1600BMoE

DeepSeek

DeepSeek-V3.2

685BMoE

DeepSeek

DeepSeek-V3

671BMoE

DeepSeek

DeepSeek-R1

671BMoE

Explore the Provider

See all DeepSeek models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every DeepSeek model we track.

Open DeepSeek

Explore the Family

See every DeepSeek release

The full DeepSeek family leaderboard with sizes, benchmark scores, and a release timeline.

Open DeepSeek

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.