12B parameter rectified flow transformer optimized for iterative, context-aware editing. Dual image-text input with strong character consistency across scenes; TensorRT-optimized for Blackwell.
Access model weights, configuration files, and documentation.
No benchmark data available for this model yet.
See which devices can run this model and at what quality level.
FLUX.1 Kontext [dev] is a 12B parameter rectified flow transformer purpose-built for high-fidelity image editing. Developed by Black Forest Labs, this model bridges the gap between text-to-image generation and precise manipulation, offering a local, open-weight alternative to proprietary editing tools. Unlike standard diffusion models that struggle with maintaining identity or scene consistency during modifications, Kontext is optimized for iterative, context-aware editing and character preservation across varying environments.
As the developer-tier version of the Kontext family, it provides the same architectural backbone as the [pro] version but is licensed for non-commercial research and local development. It occupies a unique niche in the 12B parameter space, specifically targeting practitioners who need to perform complex global or local edits—such as changing a character's clothing, altering background elements, or adjusting lighting—without losing the structural integrity of the original subject.
The model utilizes a dense 12B parameter architecture based on the rectified flow transformer framework. This design is inherited from the original FLUX.1 [dev] lineage but is fine-tuned specifically for dual image-text inputs. Where standard models take a text prompt and noise to generate an image, Kontext accepts an existing image as a foundational "context" alongside text instructions, allowing for a more directed denoising process.
A significant technical highlight is the model's optimization for the NVIDIA Blackwell architecture. Black Forest Labs collaborated with NVIDIA to produce TensorRT-optimized weights, including BF16, FP8, and even FP4 variants. These optimizations allow the model to leverage Blackwell’s specific hardware accelerators, significantly reducing memory overhead and increasing inference speed. For users on standard consumer hardware, the model remains compatible with standard FLUX.1 inference pipelines, including Hugging Face Diffusers and ComfyUI.
FLUX.1 Kontext [dev] is not a general-purpose text-to-image generator; it is a specialized tool for image-to-image manipulation. Its primary strength lies in its ability to maintain "character consistency," a notoriously difficult task for open-weight models.
The model excels at both surgical edits (changing a specific object within a frame) and global style transfers. Because it understands the spatial context of the input image, it can add, remove, or modify elements while ensuring the new pixels blend naturally with the existing lighting and perspective.
Practitioners can use Kontext for multi-stage editing. For example, a developer can generate a base character, then use Kontext in subsequent passes to change the setting from a forest to a cityscape, then change the character's expression, all while keeping the character's face and proportions identical.
While the [dev] weights are under a non-commercial license, they serve as a perfect local sandbox for developers building apps that will eventually scale to the [pro] API. It allows for the development of complex ComfyUI workflows or custom inference scripts without incurring cloud API costs during the R&D phase.
To run FLUX.1 Kontext [dev] locally, your primary bottleneck will be VRAM. The model weights alone are substantial, and the bidirectional attention mechanisms inherent in the transformer architecture require significant memory headroom during inference.
For most practitioners, the FP8 scaled version is the best balance of quality and speed. The full BF16 weights (approx. 24GB) are often too large for a single consumer GPU once you account for the VRAM required by the UI and operating system.
Expected performance on an RTX 4090 using FP8 weights typically ranges from 5 to 8 seconds per iteration. A standard 20-30 step edit can be completed in under 40 seconds. On older hardware or with aggressive CPU offloading, expect significantly longer wait times, often exceeding 50 seconds per iteration.
FluxKontextModelScale) that handle the dual-input logic.FLUX.1 Kontext [dev] enters a market previously dominated by models like Bytedance’s Bagel or HiDream-E1.
HiDream-E1 has been a popular choice for multimodal editing, but practitioners often report issues with consistency and "hallucinated" artifacts during complex edits. Kontext [dev] generally demonstrates superior character preservation and a more sophisticated understanding of global lighting.
While Gemini-Flash is a closed-source, API-only model, Kontext [dev] provides comparable—and in some benchmarks, superior—editing precision. The primary advantage of Kontext is the lack of "censorship" or "guardrail" interference that often plagues proprietary models, alongside the zero-latency benefit of local execution.
When choosing between Kontext and a standard FLUX.1 [dev] model with a ControlNet, Kontext is usually the better choice for pure editing. While ControlNets can guide generation, they often struggle with the "contextual" part of the edit—ensuring the new elements feel lived-in and stylistically matched to the original image. Kontext handles this natively within the 12B parameter transformer.