Alibaba

Qwen-Image

Base 20B image generation model from Alibaba. Foundation for the Qwen-Image editing and prompt-extend ecosystem.

20B paramsDense

A situational 20B-parameter dense image generator from Alibaba. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Performance & Scoring

Benchmarks

GenEval

81.4

Overall Score

32.8DD

Benchmark45%

50.0

Popularity25%

10.0

Efficiency20%

6.3

Versatility10%

65.0


Acer Veriton GN100 AI MiniAcer	SS	12.8 GB
AMD Instinct MI300XAMD	SS	12.8 GB
AMD Instinct MI325XAMD	SS	12.8 GB
AMD Instinct MI355XAMD	SS	12.8 GB
AMD Radeon RX 7900 XTXAMD	SS	12.8 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	12.8 GB
Apple M4Apple	SS	12.8 GB
Apple M4 Max (40-core GPU)Apple	SS	12.8 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	12.8 GB
Apple M5Apple	SS	12.8 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	12.8 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	12.8 GB
Apple Mac Mini (M2, 2023)Apple	SS	12.8 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	12.8 GB
Apple Mac Mini (M4, 2024)Apple	SS	12.8 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	12.8 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	12.8 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	SS	12.8 GB
Apple Mac Studio (M2 Max, 2023)Apple	SS	12.8 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	SS	12.8 GB
Apple Mac Studio (M3 Ultra, 2025)Apple	SS	12.8 GB
Apple Mac Studio (M4 Max, 2025)Apple	SS	12.8 GB
ASUS Ascent GX10 - 1TBASUS	SS	12.8 GB
ASUS Ascent GX10 - 2TBASUS	SS	12.8 GB
ASUS Ascent GX10 - 4TBASUS	SS	12.8 GB

About This Model

Overview

Qwen-Image is a 20 billion parameter dense image generation model from Alibaba’s Qwen team. It was released under the Apache 2.0 license and serves as the base foundation model for the broader Qwen-Image ecosystem, which includes editing-specific variants like Qwen-Image-Edit and prompt extension tools. Despite being a text-only model in terms of input modality, Qwen-Image is purpose-built for generating images from text prompts, with a specific focus on two hard problems: rendering legible text inside generated images and performing consistent, semantically aware image edits.

The model competes directly with other open-weight image generation models in the 20B parameter range, including FLUX.1-dev and SD3.5. What distinguishes Qwen-Image is its demonstrated proficiency in multi-line text rendering across both alphabetic scripts like English and logographic scripts like Chinese, as well as its ability to preserve visual fidelity during editing operations. This makes it a practical choice for practitioners who need to generate assets with embedded text — posters, infographics, signage, or UI mockups — without relying on post-processing or external OCR positioning.

Architecture & Technical Details

Find the Best Hardware for This Model

Use our hardware calculator to find the optimal device for running this model.

Qwen-Image uses a dense architecture with 20 billion parameters, meaning all parameters are active during every forward pass. This is in contrast to Mixture-of-Experts (MoE) models where only a subset of parameters activates per token. For inference hardware planning, this is a critical distinction: a dense 20B model has fixed memory requirements regardless of sequence length, and every inference step consumes the full compute budget.

The architecture is based on MMDiT (Multimodal Diffusion Transformer), a design that processes text and image latents jointly within the transformer backbone. This joint processing is what enables Qwen-Image’s text rendering capabilities — the model can align character-level token representations with spatial positions in the generated image, rather than treating text as a separate overlay step.

Context length is not specified in the official documentation, but real-world usage suggests the model handles complex multi-sentence prompts effectively, particularly for infographic-style generations. The model operates as a diffusion transformer, meaning generation proceeds through iterative denoising steps rather than autoregressive token prediction. This has implications for inference speed: you are trading sequential latency for parallelizable compute, which GPU architectures handle well.

Qwen-Image’s primary strength is text rendering within generated images. This is not a trivial capability — most image generation models struggle to produce legible, correctly spelled text, especially across multiple lines or in complex layouts. Qwen-Image handles multi-line English and Chinese text, including paragraph-level semantics, calligraphy effects, and text embedded in structured compositions like storefront signs, book covers, and presentation slides.

The model also supports consistent image editing, meaning you can provide an input image and a text instruction to modify it while preserving the original’s semantic content. This covers operations like object replacement, background changes, style transfer, and text rewriting in images. The editing pipeline is built on the same foundation model, so there is no separate architecture to manage.

Generating marketing materials with embedded product names, slogans, or pricing
Creating UI mockups with realistic button labels and headers
Producing multilingual signage or document mockups for localization testing
Building image editing pipelines where text in the source image must be rewritten while preserving font, size, and position
Generating training data for OCR systems by creating synthetic images with controlled text content

The model is not designed for tasks like video generation, 3D asset creation, or audio processing. It is an image generation and editing model, period.

Running a dense 20B model locally is feasible on consumer hardware, but you need to plan around VRAM constraints. The model weights in FP16 occupy approximately 40 GB of VRAM. This puts full-precision inference out of reach for most single consumer GPUs, but quantization brings it into practical territory.

FP16 (full precision): ~40 GB — requires dual RTX 4090s, an A6000, or an A100
Q8_0 (8-bit): ~20 GB — runs on a single RTX 4090 (24 GB) with room for KV cache
Q4_K_M (4-bit): ~10-12 GB — runs on an RTX 3090 (24 GB), RTX 4070 Ti (16 GB), or M4 Max (up to 128 GB unified memory)
Q3_K_M (3-bit): ~8-10 GB — runs on RTX 4060 Ti (16 GB) or M4 Pro

Recommended configuration for most users: Q4_K_M quantization. This strikes the best balance between output quality and hardware accessibility. At Q4_K_M, you can run Qwen-Image on a single RTX 3090 or 4090 with enough remaining VRAM for prompt processing and intermediate tensors. Quality degradation at Q4_K_M is minimal for most generation tasks, though text rendering accuracy may drop slightly at very low bit widths.

Expected performance: On an RTX 4090 at Q4_K_M, expect roughly 1-3 iterations per second for a standard 1024x1024 generation with 50 denoising steps. Total generation time per image is approximately 15-30 seconds. On an M4 Max with unified memory, expect similar or slightly slower throughput depending on memory bandwidth. The model benefits significantly from GPU compute — CPU-only inference is not practical for interactive use.

Quickest way to get started: Ollama supports Qwen-Image. A single command downloads the model at a sensible default quantization and exposes a standard API. For more control over quantization levels or inference parameters, use the Hugging Face diffusers library with the original model weights and apply quantization via bitsandbytes or torchao.

Qwen-Image vs FLUX.1-dev (12B): FLUX.1-dev is smaller and faster, with lower VRAM requirements. It produces excellent photographic realism and handles complex scenes well. Qwen-Image pulls ahead specifically on text rendering — if your use case requires legible multi-line text in any language, Qwen-Image is the better choice. FLUX is faster and more memory-efficient for general image generation without text requirements.

Qwen-Image vs SD3.5 (8B): SD3.5 is significantly smaller and runs on more modest hardware. It is a solid general-purpose model but does not match Qwen-Image on text rendering or editing consistency. If you are constrained to 12 GB VRAM or less, SD3.5 is the practical choice. If you have the hardware headroom and need text fidelity, Qwen-Image justifies the larger footprint.

When to choose Qwen-Image: Your workflow involves generating images with embedded text, building editing pipelines that require semantic consistency, or working with Chinese text rendering. Your hardware can accommodate at least 12 GB VRAM with quantization.

Qwen-Image

Our Take

Model Specifications

License

Performance & Scoring

Benchmarks

Overall Score

Hardware Compatibility

About This Model

Overview

Architecture & Technical Details

Related Models

Qwen-Image-Edit-2511

Qwen-Image-Edit

Qwen-Image-2512

Find the Best Hardware for This Model

Capabilities & Use Cases

Running Qwen-Image Locally

How It Compares

Z-Image-Turbo