Black Forest Labs

FLUX.1 [dev] FP8

FP8 quantized version of the 12B FLUX.1 dev rectified flow transformer for lower VRAM inference.

12B paramsDense

Official Page

Model Specifications

Parameters12B

ArchitectureDense

ProviderBlack Forest Labs

License

FLUX Non-Commercial License

Performance & Scoring

Benchmarks

No benchmark data available for this model yet.

Overall Score

41.5CC

Benchmark45%

50.0

Popularity25%

10.0

Efficiency20%

50.0

Versatility10%

65.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

83 devices


Acer Veriton GN100 AI MiniAcer	SS	7.9 GB
AMD Instinct MI300XAMD	SS	7.9 GB
AMD Instinct MI325XAMD	SS	7.9 GB
AMD Instinct MI355XAMD	SS	7.9 GB
AMD Radeon RX 7800 XTAMD	SS	7.9 GB
AMD Radeon RX 7900 XTAMD	SS	7.9 GB
AMD Radeon RX 7900 XTXAMD	SS	7.9 GB
AMD Radeon RX 9070AMD	SS	7.9 GB
AMD Radeon RX 9070 XTAMD	SS	7.9 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	7.9 GB
Apple M4Apple	SS	7.9 GB
Apple M4 Max (40-core GPU)Apple	SS	7.9 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	7.9 GB
Apple M5Apple	SS	7.9 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	7.9 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	7.9 GB
Apple Mac Mini (M1, 2020)Apple	SS	7.9 GB
Apple Mac Mini (M2, 2023)Apple	SS	7.9 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	7.9 GB
Apple Mac Mini (M4, 2024)Apple	SS	7.9 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	7.9 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	7.9 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	SS	7.9 GB
Apple Mac Studio (M2 Max, 2023)Apple	SS	7.9 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	SS	7.9 GB

Rows per page

Page 1 of 4

About This Model

FLUX.1 [dev] FP8 is a 12-billion parameter rectified flow transformer designed by Black Forest Labs for high-fidelity text-to-image generation. This specific version utilizes FP8 (E4M3 format) quantization to bridge the gap between the massive vRAM requirements of the full-precision model and the performance limitations of 4-bit alternatives. By leveraging reduced precision numerics, the [dev] FP8 variant achieves approximately 2x faster inference speeds compared to the standard BF16 version while maintaining nearly identical image quality.

For developers and creative engineers, this model represents the "sweet spot" for local deployment. It is optimized for the Black Forest Labs [dev] branch, which is intended for non-commercial use, research, and technical prototyping. It competes directly with other high-end open-weights diffusion models like Stable Diffusion 3 Medium, but distinguishes itself through superior prompt adherence and realistic human anatomy—specifically in complex areas like hands and legible text rendering.

Architecture and Quantization Efficiency

The FLUX.1 [dev] FP8 architecture is built on a dense 12B parameter rectified flow transformer. Unlike traditional U-Net architectures found in older diffusion models, the transformer-based approach allows for better scaling and more nuanced understanding of long, descriptive prompts.

The shift to FP8 (8-bit floating point) is the critical technical differentiator for this version. In inference, memory bandwidth is often the primary bottleneck. By reducing the weight precision from BF16 (16-bit) to FP8, the model's memory footprint is halved from ~24GB to roughly 12GB. This allows the model to fit comfortably within the VRAM of mid-range consumer GPUs while utilizing the dedicated FP8 hardware acceleration available in modern architectures like NVIDIA’s Ada Lovelace (RTX 40-series) and Hopper.

Key technical specifications include:

Parameters: 12B (Dense)
Architecture: Rectified Flow Transformer
Precision: FP8 (E4M3)
Layers: 38
Hidden Dimension: 3,072
Attention Heads: 24

Capabilities and Use Cases

FLUX.1 [dev] FP8 is a specialized text-to-image model. It is not an LLM and does not support function calling or streaming text responses. Its primary strength lies in its ability to translate complex, multi-layered natural language descriptions into high-resolution visuals.

High-Fidelity Concept Art

The model excels at generating creative concept art where stylistic consistency is required. Because it uses a 12B parameter transformer backbone, it has a deeper "world model" than smaller 2B or 3B models, allowing it to understand lighting, perspective, and material textures with high accuracy.

Typography and Text Rendering

One of the most significant hurdles for local image models has been the inclusion of legible text. FLUX.1 [dev] FP8 handles text rendering with high reliability, making it suitable for generating assets like posters, book covers, and UI mockups where specific words must be embedded in the image.

Complex Human Anatomy

The model is widely recognized for its ability to render human figures—particularly hands and limbs—without the common artifacts found in earlier diffusion models. This makes it a primary choice for practitioners who need realistic character design without extensive in-painting or post-processing.

Running FLUX.1 [dev] FP8 Locally

To run FLUX.1 [dev] FP8 locally, your hardware strategy must prioritize VRAM capacity and memory bandwidth. While the model is optimized for FP8, you still need to account for the VRAM required by your operating system, the VAE (Variational Autoencoder), and the text encoders (typically T5 and CLIP).

Hardware Requirements

Minimum VRAM: 16GB. While the weights themselves take up ~12GB, you need overhead for the text encoders and the image generation workspace. An RTX 4080 (16GB) or RTX 3090/4090 (24GB) is ideal.
Recommended GPU: NVIDIA RTX 4090. The 24GB of VRAM allows you to run the model alongside other tools without swapping to system RAM, and the Ada Lovelace architecture provides native FP8 acceleration.
Apple Silicon: M2 Ultra, M3 Max, or M4 Max with at least 32GB of Unified Memory. Metal Performance Shaders (MPS) can handle the transformer architecture, though performance may be slower than dedicated NVIDIA FP8 cores.

Performance Expectations

On an RTX 4090, you can expect the model to generate a standard 1024x1024 image in 15–25 seconds depending on the step count (20-30 steps are usually sufficient for the [dev] version). On 16GB cards, performance may dip if the system has to offload the T5 text encoder to system RAM.

Implementation Path

For practitioners looking for the fastest setup, using a specialized runner is recommended:

ComfyUI: The most flexible way to run FLUX.1 [dev] FP8. You will need the flux1-dev-fp8.safetensors file and the corresponding VAE. ComfyUI allows you to manage memory by offloading the T5 encoder after the initial prompt processing.
Forge: A high-performance fork of Automatic1111 that is specifically optimized for memory-intensive models like FLUX. It handles the FP8 quantization natively and is often faster for users with 16GB cards.
Ollama: While primarily used for LLMs, experimental support for image models is growing. However, for precise control over the FP8 flow, ComfyUI remains the industry standard for local practitioners.

How It Compares

When evaluating FLUX.1 [dev] FP8 against other local models, the trade-off is almost always between resource consumption and output quality.

FLUX.1 [dev] FP8 vs. Stable Diffusion 3 (SD3) Medium

SD3 Medium is significantly smaller, making it easier to run on 8GB or 12GB cards. However, FLUX.1 [dev] FP8 consistently outperforms SD3 in prompt adherence and anatomical correctness. If you have the 16GB+ VRAM required, FLUX is the superior choice for professional-grade outputs.

FLUX.1 [dev] FP8 vs. FLUX.1 [schnell]

The [schnell] version is a distilled 4-step model designed for speed. While [schnell] is faster, [dev] FP8 provides much higher detail and better composition. [schnell] is best for rapid prototyping, while [dev] FP8 is the choice for final asset generation.

FLUX.1 [dev] FP8 vs. Full BF16 Precision

The difference in visual quality between the 24GB BF16 version and the 12GB FP8 version is negligible for most use cases. Unless you are performing professional-tier fine-tuning or require the absolute maximum dynamic range for HDR workflows, the FP8 version is the more practical local deployment target due to its 2x speed advantage and lower hardware barrier to entry.

Related Models

Black Forest Labs

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.

12B

Black Forest Labs

FLUX.1 [dev] FP8

FP8 quantized version of the 12B FLUX.1 dev rectified flow transformer for lower VRAM inference.

12B paramsDense

Official Page

Model Specifications

Parameters12B

ArchitectureDense

ProviderBlack Forest Labs

License

FLUX Non-Commercial License

Performance & Scoring

Benchmarks

No benchmark data available for this model yet.

Overall Score

41.5CC

Benchmark45%

50.0

Popularity25%

10.0

Efficiency20%

50.0

Versatility10%

65.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

83 devices


Acer Veriton GN100 AI MiniAcer	SS	7.9 GB
AMD Instinct MI300XAMD	SS	7.9 GB
AMD Instinct MI325XAMD	SS	7.9 GB
AMD Instinct MI355XAMD	SS	7.9 GB
AMD Radeon RX 7800 XTAMD	SS	7.9 GB
AMD Radeon RX 7900 XTAMD	SS	7.9 GB
AMD Radeon RX 7900 XTXAMD	SS	7.9 GB
AMD Radeon RX 9070AMD	SS	7.9 GB
AMD Radeon RX 9070 XTAMD	SS	7.9 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	7.9 GB
Apple M4Apple	SS	7.9 GB
Apple M4 Max (40-core GPU)Apple	SS	7.9 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	7.9 GB
Apple M5Apple	SS	7.9 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	7.9 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	7.9 GB
Apple Mac Mini (M1, 2020)Apple	SS	7.9 GB
Apple Mac Mini (M2, 2023)Apple	SS	7.9 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	7.9 GB
Apple Mac Mini (M4, 2024)Apple	SS	7.9 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	7.9 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	7.9 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	SS	7.9 GB
Apple Mac Studio (M2 Max, 2023)Apple	SS	7.9 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	SS	7.9 GB

Rows per page

Page 1 of 4

About This Model

Architecture and Quantization Efficiency

Key technical specifications include:

Parameters: 12B (Dense)
Architecture: Rectified Flow Transformer
Precision: FP8 (E4M3)
Layers: 38
Hidden Dimension: 3,072
Attention Heads: 24

Capabilities and Use Cases

High-Fidelity Concept Art

Typography and Text Rendering

Complex Human Anatomy

Running FLUX.1 [dev] FP8 Locally

Hardware Requirements

Minimum VRAM: 16GB. While the weights themselves take up ~12GB, you need overhead for the text encoders and the image generation workspace. An RTX 4080 (16GB) or RTX 3090/4090 (24GB) is ideal.
Recommended GPU: NVIDIA RTX 4090. The 24GB of VRAM allows you to run the model alongside other tools without swapping to system RAM, and the Ada Lovelace architecture provides native FP8 acceleration.
Apple Silicon: M2 Ultra, M3 Max, or M4 Max with at least 32GB of Unified Memory. Metal Performance Shaders (MPS) can handle the transformer architecture, though performance may be slower than dedicated NVIDIA FP8 cores.

Performance Expectations

Implementation Path

For practitioners looking for the fastest setup, using a specialized runner is recommended:

ComfyUI: The most flexible way to run FLUX.1 [dev] FP8. You will need the flux1-dev-fp8.safetensors file and the corresponding VAE. ComfyUI allows you to manage memory by offloading the T5 encoder after the initial prompt processing.
Forge: A high-performance fork of Automatic1111 that is specifically optimized for memory-intensive models like FLUX. It handles the FP8 quantization natively and is often faster for users with 16GB cards.
Ollama: While primarily used for LLMs, experimental support for image models is growing. However, for precise control over the FP8 flow, ComfyUI remains the industry standard for local practitioners.