Stability AI

Stable Diffusion 3.5 Large

8.1B Multimodal Diffusion Transformer (MMDiT) with QK-normalization for stability. Three fixed text encoders (OpenCLIP-ViT/G, CLIP-ViT/L, T5-xxl) with 256-token context.

8.1B paramsDense

View on Hugging Face Official Page

Model Specifications

Parameters8.1B

ArchitectureDense

ProviderStability AI

Download Size75.5 GB

Community

Monthly Downloads52.3K

Likes3.4K

Last Updated1 years ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Stability Community License

Performance & Scoring

Benchmarks

No benchmark data available for this model yet.

Overall Score

60.3BB

Benchmark45%

50.0

Popularity25%

66.3

Efficiency20%

68.8

Versatility10%

75.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

83 devices


Acer Veriton GN100 AI MiniAcer	SS	5.5 GB
AMD Instinct MI300XAMD	SS	5.5 GB
AMD Instinct MI325XAMD	SS	5.5 GB
AMD Instinct MI355XAMD	SS	5.5 GB
AMD Radeon RX 7700 XTAMD	SS	5.5 GB
AMD Radeon RX 7800 XTAMD	SS	5.5 GB
AMD Radeon RX 7900 XTAMD	SS	5.5 GB
AMD Radeon RX 7900 XTXAMD	SS	5.5 GB
AMD Radeon RX 9070AMD	SS	5.5 GB
AMD Radeon RX 9070 XTAMD	SS	5.5 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	5.5 GB
Apple M4Apple	SS	5.5 GB
Apple M4 Max (40-core GPU)Apple	SS	5.5 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	5.5 GB
Apple M5Apple	SS	5.5 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	5.5 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	5.5 GB
Apple Mac Mini (M1, 2020)Apple	SS	5.5 GB
Apple Mac Mini (M2, 2023)Apple	SS	5.5 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	5.5 GB
Apple Mac Mini (M4, 2024)Apple	SS	5.5 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	5.5 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	5.5 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	SS	5.5 GB
Apple Mac Studio (M2 Max, 2023)Apple	SS	5.5 GB

Rows per page

Page 1 of 4

About This Model

Overview

Stable Diffusion 3.5 Large is Stability AI's flagship text-to-image model, released in October 2024 as the primary variant in the SD 3.5 family. At 8.1 billion parameters, it represents a significant architectural departure from the latent diffusion models that defined earlier Stable Diffusion versions, adopting a Multimodal Diffusion Transformer (MMDiT) design. This model is the successor to Stable Diffusion 3 Medium, which failed to meet community expectations, and Stability AI took additional development time to deliver a more capable release.

The model sits in the upper-mid tier of open-weight image generation models, competing with offerings like FLUX.1 (12B parameters) and older SDXL-based checkpoints. Its dense 8.1B architecture means all parameters are active during inference — unlike mixture-of-experts models that activate only a subset per forward pass. This has direct implications for VRAM requirements and generation speed that practitioners need to account for when planning local deployments.

Stable Diffusion 3.5 Large is released under the Stability AI Community License, which permits free commercial use for individuals and organizations earning under $1 million annually. For higher-revenue commercial use, access runs through the Stability API or third-party providers. The model weights are publicly downloadable from Hugging Face.

Architecture & Technical Details

Stable Diffusion 3.5 Large uses a Multimodal Diffusion Transformer (MMDiT) architecture. Unlike the U-Net backbone of SD 1.5 and SDXL, MMDiT processes text and image representations jointly through transformer blocks, enabling better cross-modal understanding during the denoising process. The model incorporates QK-normalization for training stability, a technique that normalizes query and key vectors in the attention mechanism to prevent attention logit growth during training.

The model uses three fixed text encoders for prompt understanding:

OpenCLIP-ViT/G: A large vision transformer-based text encoder trained on the LAION-5B dataset
CLIP-ViT/L: The standard CLIP large variant
T5-xxl: Google's massive encoder-decoder transformer, providing strong language understanding

These encoders operate with a combined 256-token context window. This is a practical constraint — prompts longer than 256 tokens will be truncated. For complex compositions with detailed descriptions, practitioners should optimize prompts to fit within this limit.

The MMDiT architecture means the model processes text conditioning and image generation jointly through shared transformer blocks. This differs from cross-attention approaches where text features are injected into a separate denoising backbone. The joint processing enables better alignment between prompt semantics and generated image content, which is why SD 3.5 Large shows improved typography and complex prompt understanding compared to earlier versions.

Capabilities & Use Cases

Stable Diffusion 3.5 Large excels at text-to-image generation with strong prompt adherence and improved image quality over previous SD versions. Key capabilities include:

High-resolution generation: Supports outputs up to 1 megapixel (1024×1024 native), with the ability to generate at various aspect ratios
Typography rendering: Improved text rendering in images compared to SDXL and SD 3 Medium, though still not perfect
Style versatility: Handles photorealism, concept art, illustrations, and artistic styles without requiring separate fine-tuned checkpoints
Complex prompt comprehension: Better handling of multi-element compositions and spatial relationships than prior Stable Diffusion models

Practical use cases include:

Concept art and visual development: Artists creating character designs, environment concepts, or mood boards can iterate quickly with strong prompt adherence
Product visualization: Generating product shots in specific settings or with particular lighting conditions
Content creation: Social media visuals, blog post headers, and marketing materials where quality matters but budget limits custom photography
Fine-tuning base: The model serves as a strong foundation for LoRA and DreamBooth fine-tuning, with the Civitai ecosystem already producing specialized checkpoints

The model is not ideal for out-of-the-box photorealistic quality — FLUX and Midjourney currently lead in that category. Text rendering, while improved, still produces artifacts in complex layouts. Hand generation remains a challenge, consistent with most diffusion models.

Running Stable Diffusion 3.5 Large Locally

Stable Diffusion 3.5 Large is a demanding model for local hardware due to its 8.1B dense parameter count. All 8.1B parameters are active during inference, so VRAM consumption scales linearly with precision.

VRAM Requirements

| Precision | Minimum VRAM | Recommended VRAM |

|-----------|-------------|------------------|

| FP16 (full) | 20 GB | 24 GB |

| FP8 | 12 GB | 16 GB |

| INT4 (Q4_K_M) | 8 GB | 12 GB |

FP16 requires an RTX 4090 (24 GB) or equivalent. Apple Silicon users need M4 Max or M2/3 Ultra with at least 64 GB unified memory for reasonable performance.

FP8 is the sweet spot for RTX 3090/4090 owners, offering near-lossless quality with ~12 GB VRAM usage. This allows generation alongside other applications.

INT4 quantization (Q4_K_M recommended) brings the model within reach of RTX 3080 (10-12 GB) and RTX 4070 (12 GB) cards. Quality degradation is minimal for most prompts, though fine details and text rendering may show slight degradation.

Performance Expectations

On an RTX 4090 with FP8, expect:

1024×1024 generation: 3-5 seconds per image
Batch of 4: 8-12 seconds

On an RTX 3090 with Q4_K_M:

1024×1024: 6-10 seconds per image
Batch of 4: 15-20 seconds

On Apple M4 Max (64 GB):

1024×1024: 8-12 seconds per image (FP16)

Getting Started

The fastest path to local inference is through ComfyUI or Automatic1111 WebUI, both of which support SD 3.5 Large natively. For command-line usage, the official Stability AI inference code is available on GitHub. NVIDIA NIM and TensorRT optimizations can improve inference speed by 20-40% on compatible GPUs.

For users wanting to experiment with quantization, llama.cpp and ExLlamaV2 support SD 3.5 Large with various quantization levels, though the ecosystem is less mature than for LLMs.

How It Compares

Stable Diffusion 3.5 Large vs FLUX.1 (12B)

FLUX.1 by Black Forest Labs is the primary competitor at a higher parameter count. FLUX.1 produces superior photorealism and text rendering out of the box, with better handling of hands and complex compositions. However, FLUX.1 requires 24 GB VRAM even at FP8, making it inaccessible to most consumer GPUs. SD 3.5 Large's advantage is accessibility — it runs on more hardware configurations, has a larger fine-tuning ecosystem on Civitai, and benefits from years of community tooling built around Stable Diffusion. Choose SD 3.5 Large if you need broad hardware compatibility and community support. Choose FLUX.1 if photorealism is critical and you have the hardware.

Stable Diffusion 3.5 Large vs SDXL (2.6B)

SDXL remains the most widely deployed Stable Diffusion model due to its lower hardware requirements (6-8 GB VRAM). SD 3.5 Large offers noticeably better prompt adherence, typography, and image coherence, particularly for complex prompts. However, SDXL has a vastly larger ecosystem of fine-tuned checkpoints, LoRAs, and ControlNet models. For users on RTX 3060-class hardware or older, SDXL remains the practical choice. SD 3.5 Large is worth the upgrade if you have 12+ GB VRAM and prioritize prompt fidelity over ecosystem breadth.

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.

8.1B

Stability AI

Stable Diffusion 3.5 Large

8.1B Multimodal Diffusion Transformer (MMDiT) with QK-normalization for stability. Three fixed text encoders (OpenCLIP-ViT/G, CLIP-ViT/L, T5-xxl) with 256-token context.

8.1B paramsDense

View on Hugging Face Official Page

Model Specifications

Parameters8.1B

ArchitectureDense

ProviderStability AI

Download Size75.5 GB

Community

Monthly Downloads52.3K

Likes3.4K

Last Updated1 years ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Stability Community License

Performance & Scoring

Benchmarks

No benchmark data available for this model yet.

Overall Score

60.3BB

Benchmark45%

50.0

Popularity25%

66.3

Efficiency20%

68.8

Versatility10%

75.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

83 devices


Acer Veriton GN100 AI MiniAcer	SS	5.5 GB
AMD Instinct MI300XAMD	SS	5.5 GB
AMD Instinct MI325XAMD	SS	5.5 GB
AMD Instinct MI355XAMD	SS	5.5 GB
AMD Radeon RX 7700 XTAMD	SS	5.5 GB
AMD Radeon RX 7800 XTAMD	SS	5.5 GB
AMD Radeon RX 7900 XTAMD	SS	5.5 GB
AMD Radeon RX 7900 XTXAMD	SS	5.5 GB
AMD Radeon RX 9070AMD	SS	5.5 GB
AMD Radeon RX 9070 XTAMD	SS	5.5 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	5.5 GB
Apple M4Apple	SS	5.5 GB
Apple M4 Max (40-core GPU)Apple	SS	5.5 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	5.5 GB
Apple M5Apple	SS	5.5 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	5.5 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	5.5 GB
Apple Mac Mini (M1, 2020)Apple	SS	5.5 GB
Apple Mac Mini (M2, 2023)Apple	SS	5.5 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	5.5 GB
Apple Mac Mini (M4, 2024)Apple	SS	5.5 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	5.5 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	5.5 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	SS	5.5 GB
Apple Mac Studio (M2 Max, 2023)Apple	SS	5.5 GB

Rows per page

Page 1 of 4

About This Model

Overview

Architecture & Technical Details

The model uses three fixed text encoders for prompt understanding:

OpenCLIP-ViT/G: A large vision transformer-based text encoder trained on the LAION-5B dataset
CLIP-ViT/L: The standard CLIP large variant
T5-xxl: Google's massive encoder-decoder transformer, providing strong language understanding

Capabilities & Use Cases

Stable Diffusion 3.5 Large excels at text-to-image generation with strong prompt adherence and improved image quality over previous SD versions. Key capabilities include:

High-resolution generation: Supports outputs up to 1 megapixel (1024×1024 native), with the ability to generate at various aspect ratios
Typography rendering: Improved text rendering in images compared to SDXL and SD 3 Medium, though still not perfect
Style versatility: Handles photorealism, concept art, illustrations, and artistic styles without requiring separate fine-tuned checkpoints
Complex prompt comprehension: Better handling of multi-element compositions and spatial relationships than prior Stable Diffusion models

Practical use cases include:

Concept art and visual development: Artists creating character designs, environment concepts, or mood boards can iterate quickly with strong prompt adherence
Product visualization: Generating product shots in specific settings or with particular lighting conditions
Content creation: Social media visuals, blog post headers, and marketing materials where quality matters but budget limits custom photography
Fine-tuning base: The model serves as a strong foundation for LoRA and DreamBooth fine-tuning, with the Civitai ecosystem already producing specialized checkpoints

Running Stable Diffusion 3.5 Large Locally

VRAM Requirements

| Precision | Minimum VRAM | Recommended VRAM |

|-----------|-------------|------------------|

| FP16 (full) | 20 GB | 24 GB |

| FP8 | 12 GB | 16 GB |

| INT4 (Q4_K_M) | 8 GB | 12 GB |

FP16 requires an RTX 4090 (24 GB) or equivalent. Apple Silicon users need M4 Max or M2/3 Ultra with at least 64 GB unified memory for reasonable performance.

FP8 is the sweet spot for RTX 3090/4090 owners, offering near-lossless quality with ~12 GB VRAM usage. This allows generation alongside other applications.

Performance Expectations

On an RTX 4090 with FP8, expect:

1024×1024 generation: 3-5 seconds per image
Batch of 4: 8-12 seconds

On an RTX 3090 with Q4_K_M:

1024×1024: 6-10 seconds per image
Batch of 4: 15-20 seconds

On Apple M4 Max (64 GB):

1024×1024: 8-12 seconds per image (FP16)

Getting Started

For users wanting to experiment with quantization, llama.cpp and ExLlamaV2 support SD 3.5 Large with various quantization levels, though the ecosystem is less mature than for LLMs.