Alibaba

Qwen-Image-Edit

20B image editing model built on Qwen-Image with Qwen2.5-VL for semantic control and a VAE encoder for appearance modulation. Native bilingual (Chinese/English) text editing.

20B paramsDense

A workable 20B-parameter dense image generator from Alibaba. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.

Model Specifications

Parameters20B

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0

Performance & Scoring

Benchmarks

No benchmark data available for this model yet.

Overall Score

46.8CC

Benchmark45%

50.0

Popularity25%

66.3

Efficiency20%

6.3

Versatility10%

65.0


Acer Veriton GN100 AI MiniAcer	SS	12.8 GB
AMD Instinct MI300XAMD	SS	12.8 GB
AMD Instinct MI325XAMD	SS	12.8 GB
AMD Instinct MI355XAMD	SS	12.8 GB
AMD Radeon RX 7900 XTXAMD	SS	12.8 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	12.8 GB
Apple M4Apple	SS	12.8 GB
Apple M4 Max (40-core GPU)Apple	SS	12.8 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	12.8 GB
Apple M5Apple	SS	12.8 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	12.8 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	12.8 GB
Apple Mac Mini (M2, 2023)Apple	SS	12.8 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	12.8 GB
Apple Mac Mini (M4, 2024)Apple	SS	12.8 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	12.8 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	12.8 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	SS	12.8 GB
Apple Mac Studio (M2 Max, 2023)Apple	SS	12.8 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	SS	12.8 GB
Apple Mac Studio (M3 Ultra, 2025)Apple	SS	12.8 GB
Apple Mac Studio (M4 Max, 2025)Apple	SS	12.8 GB
ASUS Ascent GX10 - 1TBASUS	SS	12.8 GB
ASUS Ascent GX10 - 2TBASUS	SS	12.8 GB
ASUS Ascent GX10 - 4TBASUS	SS	12.8 GB

About This Model

Overview

Qwen-Image-Edit is a 20B parameter dense image editing model from Alibaba's Qwen team, released under Apache 2.0. It extends Qwen-Image's text rendering capabilities into image editing, enabling both semantic transformations and pixel-level appearance modifications through natural language instructions. The model processes text-only prompts and operates as an instruction-based editor — you describe what to change, and it applies the edit.

This model competes with other large-scale instruction-based image editors in the 10B-30B parameter range. What distinguishes Qwen-Image-Edit is its dual-path architecture that separates semantic understanding from appearance control, and its native bilingual support for Chinese and English text editing. For practitioners who need to edit images programmatically without cloud APIs, this is a capable self-hosted option.

Architecture & Technical Details

Qwen-Image-Edit uses a Multimodal Diffusion Transformer (MMDiT) backbone with 20B dense parameters. The architecture processes input images through two parallel pathways:

Qwen2.5-VL handles visual semantic understanding — what objects are in the scene, their relationships, and high-level meaning
VAE Encoder captures low-level visual appearance — textures, colors, precise pixel arrangements

These two streams feed into the MMDiT backbone alongside the text prompt, allowing the model to distinguish between edits that should preserve exact pixels (appearance editing) versus edits that can change pixels while maintaining semantic meaning (semantic editing).

The model uses a Diffusers pipeline for inference and supports bfloat16 precision. Context length is not specified, but the model processes single image-text pairs per inference step rather than long sequences. The 20B parameter count means all weights are active during inference — there are no expert routing optimizations. This makes VRAM consumption predictable: the full model in bfloat16 requires approximately 40GB of VRAM before optimizations.

Semantic Editing modifies high-level content while preserving character or subject identity. Examples include style transfer (photo to anime, oil painting), object rotation with physically accurate perspective, scene transformation (changing seasons or time of day), and character variations for IP creation. The model can change most pixels in the image while keeping the subject recognizable.

Appearance Editing makes precise visual modifications while leaving all other pixels unchanged. This covers adding or removing elements, adjusting colors, replacing backgrounds, and modifying specific objects. The model preserves everything outside the edited region exactly.

Text Editing handles bilingual Chinese and English text within images. You can add, delete, or rewrite text while preserving the original font, size, color, and style. This is inherited from Qwen-Image and is the model's standout capability — most image editors struggle with coherent text rendering.

Concrete use cases: correcting text in generated images, localizing marketing materials between Chinese and English, creating consistent character variations for games or comics, batch-applying style transfers to product shots, and programmatic image editing pipelines that need deterministic local execution.

Minimum hardware (quantized): 24GB VRAM (RTX 4090, RTX 3090). Expect to use 4-bit quantization (Q4_K_M or similar) to fit the model. Inference will be slower — approximately 5-15 seconds per image depending on resolution and steps.

Recommended hardware (full precision): 48GB+ VRAM (A6000, A100, dual RTX 4090). Running in bfloat16 without quantization gives the best quality. Expect 2-5 seconds per image at 50 inference steps.

Consumer GPU realistic: An RTX 4090 with 24GB VRAM can run the Q4_K_M quantized version. Use torch.bfloat16 and set num_inference_steps to 30-50. For 20B models at this quantization level, expect around 1-2 inference seconds per step, totaling 30-100 seconds per image. Resolution also impacts performance — 512x512 is faster than 1024x1024.

Quickest way to start: Use the Hugging Face diffusers pipeline as shown in the official documentation. Install pip install git+https://github.com/huggingface/diffusers, load Qwen/Qwen-Image-Edit with QwenImageEditPipeline.from_pretrained(), and move to CUDA. No Ollama support is currently available for this model.

Quantization recommendations: Q4_K_M offers the best quality-to-speed ratio for 24GB cards. GGUF conversions are not yet officially released, but community quantizations may appear. If you have 48GB+, run bfloat16 for maximum fidelity.

vs. InstructPix2Pix (1.5B): InstructPix2Pix is significantly smaller and runs on 8GB GPUs, but its editing quality is limited. Qwen-Image-Edit produces more coherent results, handles text properly, and supports bilingual editing. Choose InstructPix2Pix if you have limited VRAM and need basic edits. Choose Qwen-Image-Edit for production-quality results and text manipulation.

vs. FLUX.1-dev (12B): FLUX is smaller and faster, with strong general image generation. However, FLUX lacks dedicated image editing capabilities — you would need to inpaint or regenerate. Qwen-Image-Edit is purpose-built for editing, with explicit support for semantic and appearance control. FLUX is better for generation; Qwen-Image-Edit is better for modifying existing images.

The tradeoff is VRAM. At 20B dense parameters, Qwen-Image-Edit requires more hardware than smaller models. If you have the GPU memory, it delivers superior editing fidelity and text handling. If you are constrained to 12GB or less, look at smaller alternatives.

Qwen-Image-Edit

Our Take

Model Specifications

Quick Start

Download from Hugging Face

License

Performance & Scoring

Benchmarks

Overall Score

Hardware Compatibility

About This Model

Overview

Architecture & Technical Details

Related Models

Qwen-Image-Edit-2511

Qwen-Image-2512

Qwen-Image

Find the Best Hardware for This Model

Community

Capabilities & Use Cases

Running Qwen-Image-Edit Locally

How It Compares

Z-Image-Turbo