8.1B Multimodal Diffusion Transformer (MMDiT) with QK-normalization for stability. Three fixed text encoders (OpenCLIP-ViT/G, CLIP-ViT/L, T5-xxl) with 256-token context.
Access model weights, configuration files, and documentation.
No benchmark data available for this model yet.
See which devices can run this model and at what quality level.
Stable Diffusion 3.5 Large is Stability AI's flagship text-to-image model, released in October 2024 as the primary variant in the SD 3.5 family. At 8.1 billion parameters, it represents a significant architectural departure from the latent diffusion models that defined earlier Stable Diffusion versions, adopting a Multimodal Diffusion Transformer (MMDiT) design. This model is the successor to Stable Diffusion 3 Medium, which failed to meet community expectations, and Stability AI took additional development time to deliver a more capable release.
The model sits in the upper-mid tier of open-weight image generation models, competing with offerings like FLUX.1 (12B parameters) and older SDXL-based checkpoints. Its dense 8.1B architecture means all parameters are active during inference — unlike mixture-of-experts models that activate only a subset per forward pass. This has direct implications for VRAM requirements and generation speed that practitioners need to account for when planning local deployments.
Stable Diffusion 3.5 Large is released under the Stability AI Community License, which permits free commercial use for individuals and organizations earning under $1 million annually. For higher-revenue commercial use, access runs through the Stability API or third-party providers. The model weights are publicly downloadable from Hugging Face.
Stable Diffusion 3.5 Large uses a Multimodal Diffusion Transformer (MMDiT) architecture. Unlike the U-Net backbone of SD 1.5 and SDXL, MMDiT processes text and image representations jointly through transformer blocks, enabling better cross-modal understanding during the denoising process. The model incorporates QK-normalization for training stability, a technique that normalizes query and key vectors in the attention mechanism to prevent attention logit growth during training.
The model uses three fixed text encoders for prompt understanding:
These encoders operate with a combined 256-token context window. This is a practical constraint — prompts longer than 256 tokens will be truncated. For complex compositions with detailed descriptions, practitioners should optimize prompts to fit within this limit.
The MMDiT architecture means the model processes text conditioning and image generation jointly through shared transformer blocks. This differs from cross-attention approaches where text features are injected into a separate denoising backbone. The joint processing enables better alignment between prompt semantics and generated image content, which is why SD 3.5 Large shows improved typography and complex prompt understanding compared to earlier versions.
Stable Diffusion 3.5 Large excels at text-to-image generation with strong prompt adherence and improved image quality over previous SD versions. Key capabilities include:
Practical use cases include:
The model is not ideal for out-of-the-box photorealistic quality — FLUX and Midjourney currently lead in that category. Text rendering, while improved, still produces artifacts in complex layouts. Hand generation remains a challenge, consistent with most diffusion models.
Stable Diffusion 3.5 Large is a demanding model for local hardware due to its 8.1B dense parameter count. All 8.1B parameters are active during inference, so VRAM consumption scales linearly with precision.
| Precision | Minimum VRAM | Recommended VRAM |
|-----------|-------------|------------------|
| FP16 (full) | 20 GB | 24 GB |
| FP8 | 12 GB | 16 GB |
| INT4 (Q4_K_M) | 8 GB | 12 GB |
FP16 requires an RTX 4090 (24 GB) or equivalent. Apple Silicon users need M4 Max or M2/3 Ultra with at least 64 GB unified memory for reasonable performance.
FP8 is the sweet spot for RTX 3090/4090 owners, offering near-lossless quality with ~12 GB VRAM usage. This allows generation alongside other applications.
INT4 quantization (Q4_K_M recommended) brings the model within reach of RTX 3080 (10-12 GB) and RTX 4070 (12 GB) cards. Quality degradation is minimal for most prompts, though fine details and text rendering may show slight degradation.
On an RTX 4090 with FP8, expect:
On an RTX 3090 with Q4_K_M:
On Apple M4 Max (64 GB):
The fastest path to local inference is through ComfyUI or Automatic1111 WebUI, both of which support SD 3.5 Large natively. For command-line usage, the official Stability AI inference code is available on GitHub. NVIDIA NIM and TensorRT optimizations can improve inference speed by 20-40% on compatible GPUs.
For users wanting to experiment with quantization, llama.cpp and ExLlamaV2 support SD 3.5 Large with various quantization levels, though the ecosystem is less mature than for LLMs.
FLUX.1 by Black Forest Labs is the primary competitor at a higher parameter count. FLUX.1 produces superior photorealism and text rendering out of the box, with better handling of hands and complex compositions. However, FLUX.1 requires 24 GB VRAM even at FP8, making it inaccessible to most consumer GPUs. SD 3.5 Large's advantage is accessibility — it runs on more hardware configurations, has a larger fine-tuning ecosystem on Civitai, and benefits from years of community tooling built around Stable Diffusion. Choose SD 3.5 Large if you need broad hardware compatibility and community support. Choose FLUX.1 if photorealism is critical and you have the hardware.
SDXL remains the most widely deployed Stable Diffusion model due to its lower hardware requirements (6-8 GB VRAM). SD 3.5 Large offers noticeably better prompt adherence, typography, and image coherence, particularly for complex prompts. However, SDXL has a vastly larger ecosystem of fine-tuned checkpoints, LoRAs, and ControlNet models. For users on RTX 3060-class hardware or older, SDXL remains the practical choice. SD 3.5 Large is worth the upgrade if you have 12+ GB VRAM and prioritize prompt fidelity over ecosystem breadth.