
20B image editing model built on Qwen-Image with Qwen2.5-VL for semantic control and a VAE encoder for appearance modulation. Native bilingual (Chinese/English) text editing.
A workable 20B-parameter dense image generator from Alibaba. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
No benchmark data available for this model yet.
See which devices can run this model and at what quality level.
Qwen-Image-Edit is a 20B parameter dense image editing model from Alibaba's Qwen team, released under Apache 2.0. It extends Qwen-Image's text rendering capabilities into image editing, enabling both semantic transformations and pixel-level appearance modifications through natural language instructions. The model processes text-only prompts and operates as an instruction-based editor — you describe what to change, and it applies the edit.
This model competes with other large-scale instruction-based image editors in the 10B-30B parameter range. What distinguishes Qwen-Image-Edit is its dual-path architecture that separates semantic understanding from appearance control, and its native bilingual support for Chinese and English text editing. For practitioners who need to edit images programmatically without cloud APIs, this is a capable self-hosted option.
Qwen-Image-Edit uses a Multimodal Diffusion Transformer (MMDiT) backbone with 20B dense parameters. The architecture processes input images through two parallel pathways:
These two streams feed into the MMDiT backbone alongside the text prompt, allowing the model to distinguish between edits that should preserve exact pixels (appearance editing) versus edits that can change pixels while maintaining semantic meaning (semantic editing).
The model uses a Diffusers pipeline for inference and supports bfloat16 precision. Context length is not specified, but the model processes single image-text pairs per inference step rather than long sequences. The 20B parameter count means all weights are active during inference — there are no expert routing optimizations. This makes VRAM consumption predictable: the full model in bfloat16 requires approximately 40GB of VRAM before optimizations.
Qwen-Image-Edit supports three categories of image editing, each with distinct behavior:
Semantic Editing modifies high-level content while preserving character or subject identity. Examples include style transfer (photo to anime, oil painting), object rotation with physically accurate perspective, scene transformation (changing seasons or time of day), and character variations for IP creation. The model can change most pixels in the image while keeping the subject recognizable.
Appearance Editing makes precise visual modifications while leaving all other pixels unchanged. This covers adding or removing elements, adjusting colors, replacing backgrounds, and modifying specific objects. The model preserves everything outside the edited region exactly.
Text Editing handles bilingual Chinese and English text within images. You can add, delete, or rewrite text while preserving the original font, size, color, and style. This is inherited from Qwen-Image and is the model's standout capability — most image editors struggle with coherent text rendering.
Concrete use cases: correcting text in generated images, localizing marketing materials between Chinese and English, creating consistent character variations for games or comics, batch-applying style transfers to product shots, and programmatic image editing pipelines that need deterministic local execution.
Qwen-Image-Edit requires significant GPU memory. Here are the hardware requirements for different configurations:
Minimum hardware (quantized): 24GB VRAM (RTX 4090, RTX 3090). Expect to use 4-bit quantization (Q4_K_M or similar) to fit the model. Inference will be slower — approximately 5-15 seconds per image depending on resolution and steps.
Recommended hardware (full precision): 48GB+ VRAM (A6000, A100, dual RTX 4090). Running in bfloat16 without quantization gives the best quality. Expect 2-5 seconds per image at 50 inference steps.
Consumer GPU realistic: An RTX 4090 with 24GB VRAM can run the Q4_K_M quantized version. Use torch.bfloat16 and set num_inference_steps to 30-50. For 20B models at this quantization level, expect around 1-2 inference seconds per step, totaling 30-100 seconds per image. Resolution also impacts performance — 512x512 is faster than 1024x1024.
Quickest way to start: Use the Hugging Face diffusers pipeline as shown in the official documentation. Install pip install git+https://github.com/huggingface/diffusers, load Qwen/Qwen-Image-Edit with QwenImageEditPipeline.from_pretrained(), and move to CUDA. No Ollama support is currently available for this model.
Quantization recommendations: Q4_K_M offers the best quality-to-speed ratio for 24GB cards. GGUF conversions are not yet officially released, but community quantizations may appear. If you have 48GB+, run bfloat16 for maximum fidelity.
vs. InstructPix2Pix (1.5B): InstructPix2Pix is significantly smaller and runs on 8GB GPUs, but its editing quality is limited. Qwen-Image-Edit produces more coherent results, handles text properly, and supports bilingual editing. Choose InstructPix2Pix if you have limited VRAM and need basic edits. Choose Qwen-Image-Edit for production-quality results and text manipulation.
vs. FLUX.1-dev (12B): FLUX is smaller and faster, with strong general image generation. However, FLUX lacks dedicated image editing capabilities — you would need to inpaint or regenerate. Qwen-Image-Edit is purpose-built for editing, with explicit support for semantic and appearance control. FLUX is better for generation; Qwen-Image-Edit is better for modifying existing images.
The tradeoff is VRAM. At 20B dense parameters, Qwen-Image-Edit requires more hardware than smaller models. If you have the GPU memory, it delivers superior editing fidelity and text handling. If you are constrained to 12GB or less, look at smaller alternatives.
