Google DeepMind's experimental text-diffusion model, built on the Gemma 4 26B-A4B Mixture-of-Experts architecture with 25.2B total parameters and 3.8B active. Instead of generating one token at a time, it denoises blocks of 256 tokens in parallel, reaching over 1,000 tokens per second on an H100 and up to 4x faster generation than standard Gemma 4. It accepts text, image, and video input, supports a 256K context window and a configurable thinking mode, and ships under Apache 2.0. It scores 77.6 on MMLU Pro, 73.2 on GPQA Diamond, 70.5 on MATH-Vision, and 69.1 on both LiveCodeBench v6 and AIME 2026 without tools.
A strong 25.2B-parameter MoE language model from Google. High composite score across our benchmark mix — worth shortlisting when raw quality matters more than VRAM budget. Newly released, so production-readiness is still being shaken out.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
No benchmark data available for this model yet.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 9.7 GB | Low | |
| Q4_K_MRecommended | 10.5 GB | Good | |
| Q5_K_M | 10.9 GB | Very Good | |
| Q6_K | 11.3 GB | Excellent | |
| Q8_0 | 12.3 GB | Near Perfect | |
| FP16 | 15.9 GB | Full |
See which devices can run this model and at what quality level.
| SS | 47.9 tok/s | 10.5 GB | ||
| SS | 61.4 tok/s | 10.5 GB | ||
| SS | 49.1 tok/s | 10.5 GB | ||
| SS | 49.1 tok/s | 10.5 GB | ||
Google Cloud TPU v5eGoogle | SS | 62.9 tok/s | 10.5 GB | |
Intel Arc A770 16GBIntel | SS | 43.0 tok/s | 10.5 GB | |
| SS | 73.7 tok/s | 10.5 GB | ||
| SS | 51.6 tok/s | 10.5 GB | ||
| SS | 56.5 tok/s | 10.5 GB | ||
| SS | 68.8 tok/s | 10.5 GB | ||
| SS | 73.7 tok/s | 10.5 GB | ||
| SS | 39.3 tok/s | 10.5 GB | ||
| SS | 73.7 tok/s | 10.5 GB | ||
NVIDIA GeForce RTX 3090NVIDIA | SS | 71.9 tok/s | 10.5 GB | |
| SS | 77.4 tok/s | 10.5 GB | ||
| SS | 125.9 tok/s | 10.5 GB | ||
| SS | 137.6 tok/s | 10.5 GB | ||
Origin PC M-CLASS v2Origin PC | SS | 137.6 tok/s | 10.5 GB | |
| SS | 34.4 tok/s | 10.5 GB | ||
NVIDIA L40SNVIDIA | SS | 66.3 tok/s | 10.5 GB | |
| SS | 73.7 tok/s | 10.5 GB | ||
Origin PC L-CLASS v2Origin PC | SS | 73.7 tok/s | 10.5 GB | |
NVIDIA A100 SXM4 80GBNVIDIA | SS | 156.5 tok/s | 10.5 GB | |
NVIDIA H100 SXM5 80GBNVIDIA | SS | 257.2 tok/s | 10.5 GB | |
NVIDIA GeForce RTX 5070NVIDIA | SS | 51.6 tok/s | 10.5 GB |
Energy cost on AMD Radeon RX 7700 XT (~33 tok/s, Q4_K_M) vs flagship API pricing.
| Source | Cost per 1M tokens |
|---|---|
Local (energy only)DiffusionGemma 26B-A4B on AMD Radeon RX 7700 XT · ~33 tok/s · 245W | $0.246 |
GPT-5.5OpenAI · in $5.00 · out $30.00 | $12.50 |
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00 | $11.00 |
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00 | $3.75 |
Grok 4.3xAI · in $1.25 · out $2.50 | $1.63 |
API prices blended at 70% input / 30% output.
Hardware amortisation not included. Run the full ROI calculator for payback math.
Cheapest current cloud rentals with at least 10 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA L4Vast.ai · Spot · 24 GB VRAM | $0.03 |
NVIDIA L4Vast.ai · On-Demand · 24 GB VRAM | $0.04 |
NVIDIA GeForce RTX 5060 TiVast.ai · Spot · 16 GB VRAM | $0.09 |
NVIDIA GeForce RTX 5060 TiVast.ai · On-Demand · 16 GB VRAM | $0.10 |
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM | $0.11 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
Google DeepMind’s DiffusionGemma 26B-A4B is the first open-weight text-diffusion model built for production inference. Rather than generating tokens one at a time in a left-to-right autoregressive chain, it denoises entire blocks of 256 tokens in parallel. This shifts the primary bottleneck from memory bandwidth to compute, delivering over 1,000 tokens per second on a single H100 and up to 4× faster generation than its Gemma 4 sibling at the same parameter count.
The model is multimodal—accepting text, image, and video input—and features a 256K token context window. It ships under Apache 2.0, making it freely usable for research, commercial products, and local deployment. DiffusionGemma is not a drop-in replacement for every LLM workload; it trades a measured quality regression for a massive improvement in generation speed. If your application needs real-time text output, code infilling, or interactive editing, this model is purpose-built for that use case.
DiffusionGemma 26B-A4B uses a Mixture-of-Experts (MoE) encoder-decoder architecture with 25.2B total parameters and only 3.8B active per forward pass. The model activates 8 out of 128 total experts (plus one shared expert) across 30 layers. This sparsity keeps memory footprint low while maintaining strong reasoning capacity—a key design choice for local execution.
The generation process works in three stages:
This block-autoregressive approach bypasses the sequential decode bottleneck that limits traditional LLMs. Instead of moving model weights from memory to compute for each token, DiffusionGemma feeds the GPU a large parallel workload, keeping tensor cores saturated.
The model supports a native system role for structured prompts, a sliding window of 1024 tokens, and a vocabulary size of 262K tokens. The vision encoder adds roughly 550M parameters for image/video processing.
DiffusionGemma excels in scenarios where generation speed matters more than maximum benchmark accuracy. Its capabilities span:
Best-fit use cases:
This model is designed to run on a single capable GPU, not a cluster. Here is what you need to know to run DiffusionGemma 26B-A4B locally.
The quickest way to run DiffusionGemma 26B-A4B locally is via Ollama. After installing Ollama, pull the model:
1ollama pull diffusiongemma:26b
Then run:
1ollama run diffusiongemma:26b
For more control, use the Hugging Face transformers library with the official Google repository. The model supports NVFP4 (NVIDIA’s 4-bit floating-point) on Blackwell GPUs for further acceleration.
| Metric | DiffusionGemma 26B | Gemma 4 26B |
|---|---|---|
| MMLU Pro | 77.6 | 82.6 |
| AIME 2026 (no tools) | 69.1 | ~79 |
| Generation speed (H100) | 1,008 t/s | ~250 t/s |
| Code infilling | Native | Not designed for this |
Choose DiffusionGemma when speed and bidirectional generation are critical. Choose Gemma 4 if you need maximum accuracy and can tolerate slower inference.
Both are MoE, but DeepSeek V2 has a 236B total parameter count and ~21B active. DeepSeek V2 scores higher on complex math and coding benchmarks but requires multiple GPUs for local deployment (80 GB+ VRAM). DiffusionGemma is more practical for single-GPU setups and is much faster on consumer hardware.
Llama 3.1 70B requires 70 GB even for quantized versions, making it infeasible for most local hardware. DiffusionGemma’s MoE efficiency (3.8B active) offers a dramatic VRAM advantage while providing competitive speed for its parameter class.
Bottom line: DiffusionGemma 26B-A4B is the fastest open-weight model under 25B total parameters for local deployment. If your workload is latency-sensitive and you can tolerate benchmark gaps of 5–15 points vs. top dense models, this is the clear choice. For maximum accuracy on math, reasoning, or complex multilingual tasks, stick with standard autoregressive models at the same active parameter count.

Explore the Provider
Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Google model we track.