Cost-efficient 284B total / 13B active MoE language model with native 1M-token context. Shares the hybrid attention architecture (CSA + HCA) and Muon-trained backbone of V4-Pro at a fraction of the cost. Reasoning closely approaches V4-Pro (GPQA Diamond 88.1, LiveCodeBench 91.6 in Max mode) while delivering faster response times and dramatically cheaper API pricing.
Access model weights, configuration files, and documentation.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 109.3 GB | Low | |
| Q4_K_MRecommended | 112.0 GB | Good | |
| Q5_K_M | 113.3 GB | Very Good | |
| Q6_K | 114.9 GB | Excellent | |
| Q8_0 | 118.2 GB | Near Perfect | |
| FP16 | 130.5 GB | Full |
See which devices can run this model and at what quality level.
Google TPU v7 (Ironwood)Google | SS | 53.0 tok/s | 112.0 GB | |
NVIDIA B200 GPUNVIDIA | SS | 57.5 tok/s | 112.0 GB | |
| SS | 43.1 tok/s | 112.0 GB | ||
| SS | 38.1 tok/s | 112.0 GB | ||
| SS | 57.5 tok/s | 112.0 GB | ||
NVIDIA H200 SXM 141GBNVIDIA | SS | 34.5 tok/s | 112.0 GB | |
| SS | 51.0 tok/s | 112.0 GB | ||
| SS | 51.0 tok/s | 112.0 GB | ||
| SS | 51.0 tok/s | 112.0 GB | ||
| SS | 51.0 tok/s | 112.0 GB | ||
SuperMicro Super AI StationSuperMicro | SS | 51.0 tok/s | 112.0 GB | |
Gigabyte W775-V10-L01Gigabyte | SS | 51.0 tok/s | 112.0 GB | |
| AA | 26.6 tok/s | 112.0 GB | ||
| BB | 5.7 tok/s | 112.0 GB | ||
| BB | 5.9 tok/s | 112.0 GB | ||
| BB | 5.9 tok/s | 112.0 GB | ||
| BB | 5.7 tok/s | 112.0 GB | ||
| BB | 4.4 tok/s | 112.0 GB | ||
| BB | 4.4 tok/s | 112.0 GB | ||
| BB | 4.4 tok/s | 112.0 GB | ||
| BB | 3.9 tok/s | 112.0 GB | ||
| BB | 3.9 tok/s | 112.0 GB | ||
| BB | 3.9 tok/s | 112.0 GB | ||
| BB | 3.9 tok/s | 112.0 GB | ||
| BB | 2.0 tok/s | 112.0 GB |
DeepSeek-V4-Flash is a 284B-parameter Mixture-of-Experts language model from DeepSeek, released in April 2026 as the cost-efficient counterpart to the flagship V4-Pro. With only 13B active parameters per token, it delivers reasoning performance that closely tracks V4-Pro—88.1 on GPQA Diamond and 91.6 on LiveCodeBench in Max mode—while requiring a fraction of the compute and memory. It is released under the MIT license with open weights available on HuggingFace.
This model occupies a specific and valuable niche: it offers frontier-level reasoning and coding capability at an inference cost that makes local deployment viable on high-end consumer hardware. The 284B total / 13B active MoE architecture means you get the representational capacity of a massive model without paying the full activation cost on every forward pass. For practitioners who need strong reasoning, coding, and instruction-following in a package that can run on a single GPU with quantization, V4-Flash is currently the most compelling option at this scale.
V4-Flash uses the same hybrid attention architecture as V4-Pro, combining Compressed Sparse Attention (CSA) with Heavily Compressed Attention (HCA). This is not a minor efficiency tweak—it fundamentally changes the cost profile for long-context inference. At 1M tokens of context, V4-Flash requires approximately 27% of the single-token inference FLOPs and 10% of the KV cache compared to what a dense model of equivalent capability would demand. The 1,000,000-token native context window is not theoretical; it works out of the box without chunking or sliding window tricks.
The MoE layout uses 284B total parameters with 13B active per token. For local deployment, this is the critical number: your GPU only needs to load the active expert weights plus shared attention parameters into VRAM for inference. The remaining 271B parameters sit idle on disk or system memory, swapped in only as the router selects different experts. This is what makes running a 284B model feasible on consumer hardware—the effective memory footprint is closer to a 13B-20B dense model than a 284B one.
The model also incorporates Manifold-Constrained Hyper-Connections (mHC) for improved signal propagation across layers, and was trained using the Muon optimizer. These architectural decisions contribute to the model's stability during long generations and its ability to maintain coherence across the full 1M-token context window.
V4-Flash is a text-only model that excels across the full spectrum of language tasks, but its standout strengths are reasoning, coding, and structured instruction-following.
Reasoning and math: GPQA Diamond 88.1 and LiveCodeBench 91.6 place it within striking distance of closed-source frontier models. For complex multi-step reasoning, chain-of-thought prompting, and mathematical problem-solving, this model performs at a level that was exclusive to API-only models six months ago.
Coding: The model handles full-stack development, code generation, debugging, and refactoring across major languages. Its function-calling support makes it viable for agentic workflows where the model needs to invoke tools, query databases, or orchestrate API calls. The 1M-token context is particularly useful for codebase analysis—you can feed an entire repository into context and ask for architecture reviews or migration plans.
Multilingual and creative writing: The model performs strongly across Chinese, English, and other major languages. Creative writing quality is high for an