DeepSeek

DeepSeek-V4-Flash

Cost-efficient 284B total / 13B active MoE language model with native 1M-token context. Shares the hybrid attention architecture (CSA + HCA) and Muon-trained backbone of V4-Pro at a fraction of the cost. Reasoning closely approaches V4-Pro (GPQA Diamond 88.1, LiveCodeBench 91.6 in Max mode) while delivering faster response times and dramatically cheaper API pricing.

284B paramsMoE1000K ctx

View on Hugging Face Source Code Official Page

Capabilities

Chat

Code Generation

Reasoning

Function Calling

Multilingual

Math

Creative Writing

Instruction Following

Model Specifications

Parameters284B

Active Params13B

ArchitectureMoE

Context Length1M tokens

ModalityText Only

ProviderDeepSeek

Download Size159.6 GB

Community

Monthly Downloads65.7K

Likes770

Last Updated2 days ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

MITView Full License

Performance & Scoring

Benchmarks

GPQA

88.1

MMLU-PRO

86.4

SWE-Verified

79.0

HLE

34.8

Terminal Bench

56.9

Overall Score

48.4CC

Benchmark40%

69.0

Popularity25%

28.8

Efficiency20%

16.1

Versatility15%

69.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	109.3 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	112.0 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	113.3 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	114.9 GB	Excellent	Near-lossless quality with manageable size
Q8_0	118.2 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	130.5 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

85 devices


Google TPU v7 (Ironwood)Google	SS	53.0 tok/s	112.0 GB
NVIDIA B200 GPUNVIDIA	SS	57.5 tok/s	112.0 GB
AMD Instinct MI325XAMD	SS	43.1 tok/s	112.0 GB
AMD Instinct MI300XAMD	SS	38.1 tok/s	112.0 GB
AMD Instinct MI355XAMD	SS	57.5 tok/s	112.0 GB
NVIDIA H200 SXM 141GBNVIDIA	SS	34.5 tok/s	112.0 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	51.0 tok/s	112.0 GB
Dell Pro Max with GB300Dell	SS	51.0 tok/s	112.0 GB
HP ZGX Fury AI StationHP	SS	51.0 tok/s	112.0 GB
MSI XpertStation WS300MSI	SS	51.0 tok/s	112.0 GB
SuperMicro Super AI StationSuperMicro	SS	51.0 tok/s	112.0 GB
Gigabyte W775-V10-L01Gigabyte	SS	51.0 tok/s	112.0 GB
Intel Gaudi 3 AI AcceleratorIntel	AA	26.6 tok/s	112.0 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	5.7 tok/s	112.0 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	BB	5.9 tok/s	112.0 GB
Apple Mac Studio (M3 Ultra, 2025)Apple	BB	5.9 tok/s	112.0 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	BB	5.7 tok/s	112.0 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	4.4 tok/s	112.0 GB
MacBook Pro 16-inch M5 Max (2026)Apple	BB	4.4 tok/s	112.0 GB
MacBook Pro 16" M5 Max (2026)Apple	BB	4.4 tok/s	112.0 GB
Apple M4 Max (40-core GPU)Apple	BB	3.9 tok/s	112.0 GB
Apple Mac Studio (M4 Max, 2025)Apple	BB	3.9 tok/s	112.0 GB
MacBook Pro 14-inch M4 Max (2024)Apple	BB	3.9 tok/s	112.0 GB
MacBook Pro 16" M4 Max (2024)Apple	BB	3.9 tok/s	112.0 GB
Acer Veriton GN100 AI MiniAcer	BB	2.0 tok/s	112.0 GB

Rows per page

Page 1 of 4

About This Model

Overview

DeepSeek-V4-Flash is a 284B-parameter Mixture-of-Experts language model from DeepSeek, released in April 2026 as the cost-efficient counterpart to the flagship V4-Pro. With only 13B active parameters per token, it delivers reasoning performance that closely tracks V4-Pro—88.1 on GPQA Diamond and 91.6 on LiveCodeBench in Max mode—while requiring a fraction of the compute and memory. It is released under the MIT license with open weights available on HuggingFace.

This model occupies a specific and valuable niche: it offers frontier-level reasoning and coding capability at an inference cost that makes local deployment viable on high-end consumer hardware. The 284B total / 13B active MoE architecture means you get the representational capacity of a massive model without paying the full activation cost on every forward pass. For practitioners who need strong reasoning, coding, and instruction-following in a package that can run on a single GPU with quantization, V4-Flash is currently the most compelling option at this scale.

Architecture & Technical Details

V4-Flash uses the same hybrid attention architecture as V4-Pro, combining Compressed Sparse Attention (CSA) with Heavily Compressed Attention (HCA). This is not a minor efficiency tweak—it fundamentally changes the cost profile for long-context inference. At 1M tokens of context, V4-Flash requires approximately 27% of the single-token inference FLOPs and 10% of the KV cache compared to what a dense model of equivalent capability would demand. The 1,000,000-token native context window is not theoretical; it works out of the box without chunking or sliding window tricks.

The MoE layout uses 284B total parameters with 13B active per token. For local deployment, this is the critical number: your GPU only needs to load the active expert weights plus shared attention parameters into VRAM for inference. The remaining 271B parameters sit idle on disk or system memory, swapped in only as the router selects different experts. This is what makes running a 284B model feasible on consumer hardware—the effective memory footprint is closer to a 13B-20B dense model than a 284B one.

The model also incorporates Manifold-Constrained Hyper-Connections (mHC) for improved signal propagation across layers, and was trained using the Muon optimizer. These architectural decisions contribute to the model's stability during long generations and its ability to maintain coherence across the full 1M-token context window.

Capabilities & Use Cases

V4-Flash is a text-only model that excels across the full spectrum of language tasks, but its standout strengths are reasoning, coding, and structured instruction-following.

Reasoning and math: GPQA Diamond 88.1 and LiveCodeBench 91.6 place it within striking distance of closed-source frontier models. For complex multi-step reasoning, chain-of-thought prompting, and mathematical problem-solving, this model performs at a level that was exclusive to API-only models six months ago.

Coding: The model handles full-stack development, code generation, debugging, and refactoring across major languages. Its function-calling support makes it viable for agentic workflows where the model needs to invoke tools, query databases, or orchestrate API calls. The 1M-token context is particularly useful for codebase analysis—you can feed an entire repository into context and ask for architecture reviews or migration plans.

Multilingual and creative writing: The model performs strongly across Chinese, English, and other major languages. Creative writing quality is high for an

Related Models

DeepSeek

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.

284B

DeepSeek

DeepSeek-V4-Flash

284B paramsMoE1000K ctx

View on Hugging Face Source Code Official Page

Capabilities

Chat

Code Generation

Reasoning

Function Calling

Multilingual

Math

Creative Writing

Instruction Following

Model Specifications

Parameters284B

Active Params13B

ArchitectureMoE

Context Length1M tokens

ModalityText Only

ProviderDeepSeek

Download Size159.6 GB

Community

Monthly Downloads65.7K

Likes770

Last Updated2 days ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

MITView Full License

Performance & Scoring

Benchmarks

GPQA

88.1

MMLU-PRO

86.4

SWE-Verified

79.0

HLE

34.8

Terminal Bench

56.9

Overall Score

48.4CC

Benchmark40%

69.0

Popularity25%

28.8

Efficiency20%

16.1

Versatility15%

69.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	109.3 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	112.0 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	113.3 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	114.9 GB	Excellent	Near-lossless quality with manageable size
Q8_0	118.2 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	130.5 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

85 devices


Google TPU v7 (Ironwood)Google	SS	53.0 tok/s	112.0 GB
NVIDIA B200 GPUNVIDIA	SS	57.5 tok/s	112.0 GB
AMD Instinct MI325XAMD	SS	43.1 tok/s	112.0 GB
AMD Instinct MI300XAMD	SS	38.1 tok/s	112.0 GB
AMD Instinct MI355XAMD	SS	57.5 tok/s	112.0 GB
NVIDIA H200 SXM 141GBNVIDIA	SS	34.5 tok/s	112.0 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	51.0 tok/s	112.0 GB
Dell Pro Max with GB300Dell	SS	51.0 tok/s	112.0 GB
HP ZGX Fury AI StationHP	SS	51.0 tok/s	112.0 GB
MSI XpertStation WS300MSI	SS	51.0 tok/s	112.0 GB
SuperMicro Super AI StationSuperMicro	SS	51.0 tok/s	112.0 GB
Gigabyte W775-V10-L01Gigabyte	SS	51.0 tok/s	112.0 GB
Intel Gaudi 3 AI AcceleratorIntel	AA	26.6 tok/s	112.0 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	5.7 tok/s	112.0 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	BB	5.9 tok/s	112.0 GB
Apple Mac Studio (M3 Ultra, 2025)Apple	BB	5.9 tok/s	112.0 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	BB	5.7 tok/s	112.0 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	4.4 tok/s	112.0 GB
MacBook Pro 16-inch M5 Max (2026)Apple	BB	4.4 tok/s	112.0 GB
MacBook Pro 16" M5 Max (2026)Apple	BB	4.4 tok/s	112.0 GB
Apple M4 Max (40-core GPU)Apple	BB	3.9 tok/s	112.0 GB
Apple Mac Studio (M4 Max, 2025)Apple	BB	3.9 tok/s	112.0 GB
MacBook Pro 14-inch M4 Max (2024)Apple	BB	3.9 tok/s	112.0 GB
MacBook Pro 16" M4 Max (2024)Apple	BB	3.9 tok/s	112.0 GB
Acer Veriton GN100 AI MiniAcer	BB	2.0 tok/s	112.0 GB

Rows per page

Page 1 of 4

About This Model

Overview

Architecture & Technical Details

Capabilities & Use Cases

V4-Flash is a text-only model that excels across the full spectrum of language tasks, but its standout strengths are reasoning, coding, and structured instruction-following.

Multilingual and creative writing: The model performs strongly across Chinese, English, and other major languages. Creative writing quality is high for an

Related Models

DeepSeek

DeepSeek-V4-Pro

1600BMoE

1600B

DeepSeek

DeepSeek-V3.2

685BMoE

DeepSeek

DeepSeek-V3

671BMoE

DeepSeek

DeepSeek-R1

671BMoE

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.