Alibaba

Z-Image-Turbo

Highly efficient 6B image generation model using a Scalable Single-Stream DiT (S3-DiT). Decoupled-DMD distillation achieves sub-second inference in 8 NFEs on 16GB VRAM.

6B paramsDense

A solid 6B-parameter dense image generator from Alibaba. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters6B

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

Performance & Scoring

Benchmarks

GenEval

83.0

Overall Score

67.6BB

Benchmark45%

50.0

Popularity25%

94.4

Efficiency20%

75.0

Versatility10%

65.0


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	4.2 GB
Acer Veriton GN100 AI MiniAcer	SS	4.2 GB
AMD Instinct MI300XAMD	SS	4.2 GB
AMD Instinct MI325XAMD	SS	4.2 GB
AMD Instinct MI355XAMD	SS	4.2 GB
AMD Radeon RX 7600 8GBAMD	SS	4.2 GB
AMD Radeon RX 7700 XTAMD	SS	4.2 GB
AMD Radeon RX 7800 XTAMD	SS	4.2 GB
AMD Radeon RX 7900 XTAMD	SS	4.2 GB
AMD Radeon RX 7900 XTXAMD	SS	4.2 GB
AMD Radeon RX 9070AMD	SS	4.2 GB
AMD Radeon RX 9070 XTAMD	SS	4.2 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	4.2 GB
Apple M4Apple	SS	4.2 GB
Apple M4 Max (40-core GPU)Apple	SS	4.2 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	4.2 GB
Apple M5Apple	SS	4.2 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	4.2 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	4.2 GB
Apple Mac Mini (M1, 2020)Apple	SS	4.2 GB
Apple Mac Mini (M2, 2023)Apple	SS	4.2 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	4.2 GB
Apple Mac Mini (M4, 2024)Apple	SS	4.2 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	4.2 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	4.2 GB

About This Model

Overview

Z-Image-Turbo is a 6-billion parameter text-to-image generation model from Alibaba’s Tongyi MAI team, released in late 2024 under the Apache 2.0 license. It is the distilled variant of the larger Z-Image foundation model, optimized for speed and efficiency rather than maximum quality or flexibility. The model uses a Scalable Single-Stream DiT (S3-DiT) architecture and Decoupled-DMD distillation to achieve sub-second inference in just 8 network function evaluations (NFEs).

What distinguishes Z-Image-Turbo from other open-weight image generation models is its ability to run on consumer-grade hardware with 16GB VRAM while producing photorealistic outputs in 1-3 seconds. It competes directly with models like FLUX.1-schnell and SD3.5-Turbo in the fast-inference text-to-image space, but its 6B parameter density and distillation approach give it a distinct speed advantage at comparable image quality. The model excels at photorealistic generation, bilingual text rendering (English and Chinese), and following complex prompts with high fidelity.

Architecture & Technical Details

Find the Best Hardware for This Model

Use our hardware calculator to find the optimal device for running this model.

Z-Image-Turbo is a dense 6B parameter model built on the S3-DiT architecture. Unlike Mixture-of-Experts (MoE) models that activate only a subset of parameters per forward pass, this is a fully dense transformer—all 6B parameters are active during inference. This means VRAM consumption scales linearly with parameter count, but the tradeoff is consistent, predictable performance without routing overhead.

The key architectural innovation is Decoupled-DMD distillation, which compresses the inference process from the 28-50 steps required by the base Z-Image model down to just 8 steps. This is achieved by decoupling the distillation of the diffusion trajectory from the classifier-free guidance (CFG) mechanism. As a result, Z-Image-Turbo does not support CFG scaling or negative prompting—these features were sacrificed to achieve the 8-step inference target. The model relies on reinforcement learning during training to improve instruction adherence instead.

The single-stream design means text and image features are processed through a unified transformer backbone rather than separate encoders. This reduces computational overhead and memory bandwidth requirements, which is why the model fits in 16GB VRAM despite its parameter count. The architecture natively supports 1024x1024 image generation at the standard resolution.

Z-Image-Turbo is purpose-built for fast, photorealistic image generation from text prompts. Its standout capabilities include:

Photorealism: The model produces high-fidelity images with accurate lighting, textures, and composition. Output quality at 8 steps matches or exceeds competitors that require 20-50 steps.
Bilingual text rendering: Accurately generates readable English and Chinese text within images—a rare capability in open-weight models at this size. This is critical for generating signs, posters, UI mockups, and marketing materials.
Instruction adherence: Handles complex, multi-part prompts reliably. The RL-based training gives it strong prompt following without requiring negative prompting or CFG tuning.

Real-time content generation: Live streaming overlays, dynamic ad creatives, and on-the-fly image generation for web applications where latency matters.
High-volume batch generation: The 8-step inference makes it viable for generating thousands of images per hour on a single GPU, useful for synthetic data pipelines and A/B testing creative assets.
Local prototyping: Designers and developers who need fast iteration cycles without cloud API calls can generate concept art, UI mockups, and marketing visuals directly on their workstation.
Bilingual applications: Generating content that mixes English and Chinese text, such as bilingual signage, product labels, or educational materials.

The model does not support negative prompting, and its diversity is lower than the base Z-Image model due to distillation. For applications requiring high stylistic variety or fine-grained control via negative prompts, the full Z-Image model is a better choice—but you will need 24GB VRAM and accept 28-50 step inference.

Z-Image-Turbo is one of the most accessible 6B text-to-image models for local deployment. The model runs comfortably within 16GB VRAM at full precision (FP16), which means it works on:

NVIDIA RTX 4090 (24GB): Full speed, no compromises. Sub-second inference is achievable.
NVIDIA RTX 4080 / 4070 Ti (16GB): Runs at full precision with no quantization needed. Expect 1-2 seconds per image.
NVIDIA RTX 4060 / A4000 (16GB): Viable at FP16. Performance will be slower but usable.
Apple M4 Max (48GB unified memory): Runs well via MLX or Diffusers with Metal backend. Expect 2-4 seconds per image.
NVIDIA RTX 3060 / 4060 (12GB): Requires 8-bit quantization. Still usable with slightly reduced quality.

FP16 (full precision): Best quality, needs 16GB VRAM. This is the recommended configuration for anyone with 16GB or more.
8-bit (Q8): Reduces VRAM to approximately 8-9GB. Minimal quality loss. Good for 12GB cards.
4-bit (Q4_K_M): Fits in 6-7GB VRAM. Noticeable quality degradation but functional. Useful for 8GB cards or running alongside other workloads.

For most users with a 16GB card, run at FP16. Only quantize if you are constrained by VRAM.

Inference time: 0.8-1.5 seconds per 1024x1024 image at 8 steps
Throughput: Approximately 40-60 images per minute in batch mode

The quickest path to running Z-Image-Turbo locally is through the Hugging Face Diffusers library or the official GitHub repository (Tongyi-MAI/Z-Image). The repository provides inference.py and batch_inference.py scripts for single and batch generation. The model is also available on Hugging Face at Tongyi-MAI/Z-Image-Turbo.

For Apple Silicon users, the MLX community has produced working implementations that leverage the unified memory architecture effectively.

Z-Image-Turbo vs FLUX.1-schnell: Both target fast 4-8 step inference. FLUX.1-schnell (12B parameters) produces slightly higher aesthetic quality and supports negative prompting, but requires 24GB VRAM at FP16 and runs slower due to the larger architecture. Z-Image-Turbo wins on VRAM efficiency (16GB vs 24GB) and raw speed, and matches or exceeds FLUX.1-schnell on photorealism and text rendering. Choose FLUX.1-schnell if you need negative prompting and have 24GB VRAM. Choose Z-Image-Turbo if you are constrained to 16GB or need the fastest possible inference.

Z-Image-Turbo vs SD3.5-Turbo: SD3.5-Turbo (8B parameters, MMDiT architecture) also targets fast inference but requires approximately 20GB VRAM at FP16 and uses 4-6 steps. Z-Image-Turbo is more VRAM-efficient and generates consistently better bilingual text. SD3.5-Turbo offers more stylistic diversity and supports CFG. If you need diverse artistic styles, SD3.5-Turbo is the better pick. If you need photorealistic outputs, fast iteration, or bilingual text, Z-Image-Turbo is the stronger choice.

Bottom line: Z-Image-Turbo is the best option for local deployment if you have 16GB VRAM, need sub-2-second generation, and prioritize photorealism and text rendering over stylistic diversity. It is not a general-purpose image generation model—it is a specialized fast inference tool, and it excels at that specific job.

Z-Image-Turbo

Our Take

Model Specifications

Quick Start

Download from Hugging Face

Performance & Scoring

Benchmarks

Overall Score

Hardware Compatibility

About This Model

Overview

Architecture & Technical Details

Related Models

Qwen-Image-Edit-2511

Qwen-Image-Edit

Qwen-Image-2512

Find the Best Hardware for This Model

Community

Capabilities & Use Cases

Running Z-Image-Turbo Locally

Hardware Requirements

Quantization Options

Performance Expectations

Getting Started

How It Compares

Qwen-Image