StepFun

Step1X-Edit

Unified image editing model that translates MLLM textual parsing into discrete editing tokens rendered by a DiT decoder. Scores 6.97 on GEdit-En, surpassing OmniGen2 and FLUX.1 Kontext.

B paramsDense

View on Hugging Face Source Code Official Page

Model Specifications

ParametersnullB

ArchitectureDense

ProviderStepFun

Download Size77.4 GB

Community

Monthly Downloads112

Likes331

Last Updated9 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0

Performance & Scoring

Benchmarks

No benchmark data available for this model yet.

Overall Score

53.8CC

Benchmark45%

50.0

Popularity25%

28.1

Efficiency20%

93.8

Versatility10%

55.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

83 devices


Acer Veriton GN100 AI MiniAcer	SS	0.5 GB
AMD Instinct MI300XAMD	SS	0.5 GB
AMD Instinct MI325XAMD	SS	0.5 GB
AMD Instinct MI355XAMD	SS	0.5 GB
AMD Radeon RX 7600 8GBAMD	SS	0.5 GB
AMD Radeon RX 7700 XTAMD	SS	0.5 GB
AMD Radeon RX 7800 XTAMD	SS	0.5 GB
AMD Radeon RX 7900 XTAMD	SS	0.5 GB
AMD Radeon RX 7900 XTXAMD	SS	0.5 GB
AMD Radeon RX 9070AMD	SS	0.5 GB
AMD Radeon RX 9070 XTAMD	SS	0.5 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	0.5 GB
Apple M4Apple	SS	0.5 GB
Apple M4 Max (40-core GPU)Apple	SS	0.5 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	0.5 GB
Apple M5Apple	SS	0.5 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	0.5 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	0.5 GB
Apple Mac Mini (M1, 2020)Apple	SS	0.5 GB
Apple Mac Mini (M2, 2023)Apple	SS	0.5 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	0.5 GB
Apple Mac Mini (M4, 2024)Apple	SS	0.5 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	0.5 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	0.5 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	SS	0.5 GB

Rows per page

Page 1 of 4

About This Model

Overview

Step1X-Edit is a unified image editing model developed by StepFun, designed to translate natural language editing instructions into precise image modifications. Unlike diffusion-based editors that rely on latent space manipulations, Step1X-Edit takes a different approach: it uses a multimodal large language model (MLLM) to parse textual instructions into discrete editing tokens, which are then rendered by a Diffusion Transformer (DiT) decoder. This architecture allows it to handle complex, multi-step edits while maintaining image fidelity.

The model scores 6.97 on the GEdit-En benchmark, outperforming both OmniGen2 and FLUX.1 Kontext on this metric. StepFun has also released a companion benchmark, GEdit-Bench, designed to reflect real-world editing scenarios rather than synthetic test cases. The model is released under Apache 2.0 license, making it freely available for commercial and research use.

Step1X-Edit competes directly with closed-source solutions like GPT-4o and Gemini 2 Flash for image editing tasks, but with the advantage of local execution. For practitioners evaluating open-source alternatives, this model represents a practical option for production image editing pipelines where data privacy and latency matter.

Architecture & Technical Details

Step1X-Edit uses a dense architecture with undisclosed parameters. The model pipeline works in two stages: the MLLM component parses the input image and text instruction, generating discrete editing tokens that represent the desired changes. The DiT decoder then renders these tokens into the final edited image.

The model file size is approximately 27.35GB, which gives a practical indicator of its memory footprint. While the exact parameter count is not disclosed, the file size suggests a model in the 7B-12B parameter range for the MLLM component, with a separate DiT decoder. The architecture is not MoE-based, meaning all parameters are active during inference — this has implications for VRAM requirements and inference speed.

Context length is not officially specified, but third-party sources report 32K tokens. This is sufficient for high-resolution image editing tasks where the model needs to process both the full image and detailed instructions. The model supports English language input natively.

For local deployment, the architecture's dense nature means you cannot trade off quality for speed by reducing active parameters. However, the separate decoder design does allow for potential optimizations — the MLLM and DiT components can potentially be quantized independently.

Capabilities & Use Cases

Step1X-Edit is a text-to-image editing model, not a text-to-image generator. It takes an existing image and an editing instruction, then produces a modified version. This makes it suitable for:

Precise object manipulation: Adding, removing, or modifying specific elements within an image while preserving the rest of the scene
Style transfer: Applying stylistic changes to images while maintaining structural integrity
Context-aware edits: The MLLM component understands spatial relationships and object semantics, enabling edits that respect the original image's composition
Multi-step instructions: Complex instructions like "replace the red car with a blue truck and add a tree in the background" are handled in a single pass

The model's benchmark performance on GEdit-En (6.97) indicates strong capability in following detailed editing instructions accurately. The GEdit-Bench benchmark focuses on real-world editing scenarios, which suggests the model generalizes well beyond synthetic test cases.

For developers, this means Step1X-Edit can be integrated into automated image processing pipelines, content creation workflows, or any application requiring programmatic image editing with natural language control.

Running Step1X-Edit Locally

VRAM Requirements

Based on the 27.35GB model file size and dense architecture, here are realistic VRAM estimates:

Minimum (Q4_K_M quantization): 16GB VRAM. This allows the model to run with moderate quality degradation. An RTX 4080 Super or 3090 can handle this.
Recommended (Q8_0 or FP16): 24GB VRAM. For full precision inference, an RTX 4090 or A5000 provides comfortable headroom.
Maximum quality (FP16 with batch processing): 32GB+ VRAM. An A6000 or dual GPU setup for production workloads.

Consumer Hardware

RTX 4090 (24GB): Can run Step1X-Edit at Q8_0 or Q4_K_M with reasonable performance. Expect 1-3 seconds per edit at Q4_K_M, longer at higher precision.
RTX 4080 Super (16GB): Viable with Q4_K_M quantization. Expect 2-4 seconds per edit.
M4 Max (48GB unified): Can run the model at FP16 comfortably. Performance depends on memory bandwidth but should be competitive with desktop GPUs.
RTX 3090 (24GB): Similar to 4090 but slower memory bandwidth. Expect 3-5 seconds per edit at Q4_K_M.

Quantization Guidance

The DiT decoder component is more sensitive to quantization than the MLLM. For most users, Q4_K_M offers the best quality-to-performance ratio. Q8_0 provides near-lossless quality at the cost of higher VRAM. FP16 is only recommended if you have 32GB+ VRAM and need maximum fidelity.

Performance Expectations

At Q4_K_M on an RTX 4090, expect:

Simple edits (color change, single object removal): 1-2 seconds
Complex edits (multi-object manipulation, style changes): 3-5 seconds
Batch processing: Dependent on VRAM for holding multiple image contexts

The RegionE optimization (released December 2025) delivers a 2.5x speedup with no accuracy degradation, requiring only five lines of code changes. This makes it practical for production deployment where throughput matters.

Getting Started

The fastest path to local deployment is through the official inference code on GitHub. The model supports transformers library integration. For developers wanting an API-like interface, the Gradio app included in the repository provides a web UI for testing.

How It Compares

Step1X-Edit vs. OmniGen2: Step1X-Edit outperforms OmniGen2 on GEdit-En (6.97 vs. lower score), particularly on complex editing instructions. OmniGen2 is a general-purpose multimodal model, while Step1X-Edit is specialized for editing, which gives it an advantage in edit precision. However, OmniGen2 offers broader capabilities beyond editing.

Step1X-Edit vs. FLUX.1 Kontext: FLUX.1 Kontext scores 6.51 on GEdit-En, placing it below Step1X-Edit. FLUX is stronger as a pure image generation model, but Step1X-Edit's architecture is specifically optimized for editing existing images. If your workflow is primarily generation with occasional edits, FLUX may be more suitable. If editing is the primary task, Step1X-Edit is the better choice.

Step1X-Edit vs. closed-source alternatives: The model aims to match GPT-4o and Gemini 2 Flash on image editing tasks. While exact comparisons are difficult without standardized benchmarks, Step1X-Edit's open-source nature and Apache 2.0 license make it the only option for local, private deployment at this capability level.

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.

StepFun

Step1X-Edit

Unified image editing model that translates MLLM textual parsing into discrete editing tokens rendered by a DiT decoder. Scores 6.97 on GEdit-En, surpassing OmniGen2 and FLUX.1 Kontext.

B paramsDense

View on Hugging Face Source Code Official Page

Model Specifications

ParametersnullB

ArchitectureDense

ProviderStepFun

Download Size77.4 GB

Community

Monthly Downloads112

Likes331

Last Updated9 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0

Performance & Scoring

Benchmarks

No benchmark data available for this model yet.

Overall Score

53.8CC

Benchmark45%

50.0

Popularity25%

28.1

Efficiency20%

93.8

Versatility10%

55.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

83 devices


Acer Veriton GN100 AI MiniAcer	SS	0.5 GB
AMD Instinct MI300XAMD	SS	0.5 GB
AMD Instinct MI325XAMD	SS	0.5 GB
AMD Instinct MI355XAMD	SS	0.5 GB
AMD Radeon RX 7600 8GBAMD	SS	0.5 GB
AMD Radeon RX 7700 XTAMD	SS	0.5 GB
AMD Radeon RX 7800 XTAMD	SS	0.5 GB
AMD Radeon RX 7900 XTAMD	SS	0.5 GB
AMD Radeon RX 7900 XTXAMD	SS	0.5 GB
AMD Radeon RX 9070AMD	SS	0.5 GB
AMD Radeon RX 9070 XTAMD	SS	0.5 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	0.5 GB
Apple M4Apple	SS	0.5 GB
Apple M4 Max (40-core GPU)Apple	SS	0.5 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	0.5 GB
Apple M5Apple	SS	0.5 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	0.5 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	0.5 GB
Apple Mac Mini (M1, 2020)Apple	SS	0.5 GB
Apple Mac Mini (M2, 2023)Apple	SS	0.5 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	0.5 GB
Apple Mac Mini (M4, 2024)Apple	SS	0.5 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	0.5 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	0.5 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	SS	0.5 GB

Rows per page

Page 1 of 4

About This Model

Overview

Architecture & Technical Details

Capabilities & Use Cases

Step1X-Edit is a text-to-image editing model, not a text-to-image generator. It takes an existing image and an editing instruction, then produces a modified version. This makes it suitable for:

Precise object manipulation: Adding, removing, or modifying specific elements within an image while preserving the rest of the scene
Style transfer: Applying stylistic changes to images while maintaining structural integrity
Context-aware edits: The MLLM component understands spatial relationships and object semantics, enabling edits that respect the original image's composition
Multi-step instructions: Complex instructions like "replace the red car with a blue truck and add a tree in the background" are handled in a single pass

Running Step1X-Edit Locally

VRAM Requirements

Based on the 27.35GB model file size and dense architecture, here are realistic VRAM estimates:

Minimum (Q4_K_M quantization): 16GB VRAM. This allows the model to run with moderate quality degradation. An RTX 4080 Super or 3090 can handle this.
Recommended (Q8_0 or FP16): 24GB VRAM. For full precision inference, an RTX 4090 or A5000 provides comfortable headroom.
Maximum quality (FP16 with batch processing): 32GB+ VRAM. An A6000 or dual GPU setup for production workloads.

Consumer Hardware

RTX 4090 (24GB): Can run Step1X-Edit at Q8_0 or Q4_K_M with reasonable performance. Expect 1-3 seconds per edit at Q4_K_M, longer at higher precision.
RTX 4080 Super (16GB): Viable with Q4_K_M quantization. Expect 2-4 seconds per edit.
M4 Max (48GB unified): Can run the model at FP16 comfortably. Performance depends on memory bandwidth but should be competitive with desktop GPUs.
RTX 3090 (24GB): Similar to 4090 but slower memory bandwidth. Expect 3-5 seconds per edit at Q4_K_M.