Unified image editing model that translates MLLM textual parsing into discrete editing tokens rendered by a DiT decoder. Scores 6.97 on GEdit-En, surpassing OmniGen2 and FLUX.1 Kontext.
Access model weights, configuration files, and documentation.
No benchmark data available for this model yet.
See which devices can run this model and at what quality level.
Step1X-Edit is a unified image editing model developed by StepFun, designed to translate natural language editing instructions into precise image modifications. Unlike diffusion-based editors that rely on latent space manipulations, Step1X-Edit takes a different approach: it uses a multimodal large language model (MLLM) to parse textual instructions into discrete editing tokens, which are then rendered by a Diffusion Transformer (DiT) decoder. This architecture allows it to handle complex, multi-step edits while maintaining image fidelity.
The model scores 6.97 on the GEdit-En benchmark, outperforming both OmniGen2 and FLUX.1 Kontext on this metric. StepFun has also released a companion benchmark, GEdit-Bench, designed to reflect real-world editing scenarios rather than synthetic test cases. The model is released under Apache 2.0 license, making it freely available for commercial and research use.
Step1X-Edit competes directly with closed-source solutions like GPT-4o and Gemini 2 Flash for image editing tasks, but with the advantage of local execution. For practitioners evaluating open-source alternatives, this model represents a practical option for production image editing pipelines where data privacy and latency matter.
Step1X-Edit uses a dense architecture with undisclosed parameters. The model pipeline works in two stages: the MLLM component parses the input image and text instruction, generating discrete editing tokens that represent the desired changes. The DiT decoder then renders these tokens into the final edited image.
The model file size is approximately 27.35GB, which gives a practical indicator of its memory footprint. While the exact parameter count is not disclosed, the file size suggests a model in the 7B-12B parameter range for the MLLM component, with a separate DiT decoder. The architecture is not MoE-based, meaning all parameters are active during inference — this has implications for VRAM requirements and inference speed.
Context length is not officially specified, but third-party sources report 32K tokens. This is sufficient for high-resolution image editing tasks where the model needs to process both the full image and detailed instructions. The model supports English language input natively.
For local deployment, the architecture's dense nature means you cannot trade off quality for speed by reducing active parameters. However, the separate decoder design does allow for potential optimizations — the MLLM and DiT components can potentially be quantized independently.
Step1X-Edit is a text-to-image editing model, not a text-to-image generator. It takes an existing image and an editing instruction, then produces a modified version. This makes it suitable for:
The model's benchmark performance on GEdit-En (6.97) indicates strong capability in following detailed editing instructions accurately. The GEdit-Bench benchmark focuses on real-world editing scenarios, which suggests the model generalizes well beyond synthetic test cases.
For developers, this means Step1X-Edit can be integrated into automated image processing pipelines, content creation workflows, or any application requiring programmatic image editing with natural language control.
Based on the 27.35GB model file size and dense architecture, here are realistic VRAM estimates:
The DiT decoder component is more sensitive to quantization than the MLLM. For most users, Q4_K_M offers the best quality-to-performance ratio. Q8_0 provides near-lossless quality at the cost of higher VRAM. FP16 is only recommended if you have 32GB+ VRAM and need maximum fidelity.
At Q4_K_M on an RTX 4090, expect:
The RegionE optimization (released December 2025) delivers a 2.5x speedup with no accuracy degradation, requiring only five lines of code changes. This makes it practical for production deployment where throughput matters.
The fastest path to local deployment is through the official inference code on GitHub. The model supports transformers library integration. For developers wanting an API-like interface, the Gradio app included in the repository provides a web UI for testing.
Step1X-Edit vs. OmniGen2: Step1X-Edit outperforms OmniGen2 on GEdit-En (6.97 vs. lower score), particularly on complex editing instructions. OmniGen2 is a general-purpose multimodal model, while Step1X-Edit is specialized for editing, which gives it an advantage in edit precision. However, OmniGen2 offers broader capabilities beyond editing.
Step1X-Edit vs. FLUX.1 Kontext: FLUX.1 Kontext scores 6.51 on GEdit-En, placing it below Step1X-Edit. FLUX is stronger as a pure image generation model, but Step1X-Edit's architecture is specifically optimized for editing existing images. If your workflow is primarily generation with occasional edits, FLUX may be more suitable. If editing is the primary task, Step1X-Edit is the better choice.
Step1X-Edit vs. closed-source alternatives: The model aims to match GPT-4o and Gemini 2 Flash on image editing tasks. While exact comparisons are difficult without standardized benchmarks, Step1X-Edit's open-source nature and Apache 2.0 license make it the only option for local, private deployment at this capability level.