Ideogram's first open-weight text-to-image model, released 2026-06-03. It is a 9.3B-parameter single-stream Diffusion Transformer with 34 layers, trained from scratch, that uses Qwen3-VL-8B-Instruct as its text encoder and generates at up to 2048 px per side. Ideogram reports a 0.97 X-Omni English OCR score, the strongest in-image text rendering among open-weight models at its size, alongside a 1062 designer-preference ELO that placed it first among open-weight models (second overall) on its launch leaderboard. Weights, inference code, and a prompting guide ship publicly under a non-commercial license, with a separate commercial license for production use.
A solid 9.3B-parameter dense image generator from Ideogram. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing. Newly released, so production-readiness is still being shaken out.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
No benchmark data available for this model yet.
See which devices can run this model and at what quality level.
Cheapest current cloud rentals with at least 6 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA L4Vast.ai · Spot · 24 GB VRAM | $0.03 |
NVIDIA L4Vast.ai · On-Demand · 24 GB VRAM | $0.04 |
NVIDIA GeForce RTX 5060 TiVast.ai · Spot · 16 GB VRAM | $0.09 |
NVIDIA GeForce RTX 5060 TiVast.ai · On-Demand · 16 GB VRAM | $0.10 |
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM | $0.11 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
Ideogram 4.0 is Ideogram's first open-weight text-to-image model, released June 3, 2026. It is a 9.3B-parameter foundation model trained from scratch — not a fine-tune or distillation of any existing architecture. The model targets practitioners who need precise design output, particularly in-image text rendering, structured layout control, and multilingual typography.
The model immediately set a new bar for open-weight image generation. On its launch leaderboard, it scored a 0.97 X-Omni English OCR score — the strongest in-image text rendering among open-weight models at its size — and a 1062 designer-preference ELO that placed it first among open-weight models and second overall. The weights, inference code, and prompting guide ship under a non-commercial license, with a separate commercial license for production use.
Ideogram 4.0 uses a single-stream Diffusion Transformer (DiT) with 34 layers. Text and image tokens share the same projections at every layer — a design choice that has become standard among recent open-weight image models, but with two key differentiators.
First, the text encoder is Qwen3-VL-8B-Instruct, a vision-language model used in text-only mode. The DiT consumes hidden states from 13 of its intermediate layers concatenated along the feature dimension, rather than a single final hidden state or no external encoder at all. This gives the model richer semantic understanding of prompts, particularly for complex descriptions and multilingual text.
Second, the model was trained exclusively on structured JSON captions with per-element styling, optional bounding boxes, and color palettes. The reference inference pipeline parses every prompt as JSON and validates it against the schema before generation. This means you can specify exact layout positions, text content, and color schemes in a structured format rather than hoping the model interprets natural language correctly.
The pipeline consists of four components:
The model generates at up to 2048 px per side. Two quantizations are available at launch: nf4 (CUDA, supported in Diffusers) and fp8 (all hardware, no Diffusers support yet).
Ideogram 4.0 excels in three areas that have historically been weak points for open-weight image models:
In-image text rendering. The 0.97 X-Omni OCR score is not a benchmark artifact — the model renders English, Chinese, Japanese, Korean, and other scripts legibly within generated images. This is useful for poster mockups, signage, packaging design, UI screenshots, and any workflow where text must be readable, not decorative.
Structured layout control. The JSON prompting interface lets you specify bounding boxes for individual elements, define color palettes, and control per-element styling. This is valuable for design iteration where you need consistent placement across multiple generations — product shots with specific text placement, magazine layouts, or multi-element compositions.
Multilingual support. Because the text encoder is a vision-language model with multilingual training, Ideogram 4.0 handles non-English prompts and generates text in multiple scripts more reliably than models that rely on CLIP or T5 encoders alone.
Concrete use cases:
Ideogram 4.0 is a 9.3B dense model, meaning all parameters are active during inference. This puts it in the same VRAM class as other 7B-13B image generation models, with the caveat that the full pipeline (encoder + DiT + VAE) increases total memory pressure.
Minimum hardware requirements:
Recommended hardware:
Expected performance (nf4 on RTX 4090):
Quantization recommendations:
The quickest way to start is via the official GitHub repository (ideogram-oss/ideogram4), which ships inference code and a prompting guide. Diffusers support for nf4 means you can integrate it into existing pipelines with minimal friction.
vs. FLUX.1-dev (12B): FLUX produces higher photorealism and aesthetic quality in natural scenes, but Ideogram 4.0 wins on text rendering and structured layout control by a wide margin. FLUX has no native JSON prompting or bounding-box support. If your work is photography-adjacent, FLUX is stronger. If you need design output with embedded text, Ideogram 4.0 is the better choice.
vs. SD3.5 (8B): SD3.5 has a larger ecosystem (LoRAs, ControlNets, community tools) and runs on more hardware configurations. Ideogram 4.0 produces higher-resolution output natively (2048px vs 1024px) and has superior text rendering. SD3.5 is the safer bet for general-purpose generation; Ideogram 4.0 is specialized for design work.
The tradeoff is license and ecosystem maturity. Ideogram 4.0 uses a non-commercial license by default, with a separate commercial license for production use. It also lacks the community-trained LoRAs and fine-tunes that have accumulated around SD3.5 and FLUX. For practitioners who need design-specific capabilities and are willing to work within the licensing terms, Ideogram 4.0 is currently the strongest open-weight option in its class.