Google

DiffusionGemma 26B-A4B

Google DeepMind's experimental text-diffusion model, built on the Gemma 4 26B-A4B Mixture-of-Experts architecture with 25.2B total parameters and 3.8B active. Instead of generating one token at a time, it denoises blocks of 256 tokens in parallel, reaching over 1,000 tokens per second on an H100 and up to 4x faster generation than standard Gemma 4. It accepts text, image, and video input, supports a 256K context window and a configurable thinking mode, and ships under Apache 2.0. It scores 77.6 on MMLU Pro, 73.2 on GPQA Diamond, 70.5 on MATH-Vision, and 69.1 on both LiveCodeBench v6 and AIME 2026 without tools.

25.2B paramsMoE256K ctxMultimodal

View on Hugging Face Official Page

Our Take

Best for: Workstation-class chat and code with usable context

A strong 25.2B-parameter MoE language model from Google. High composite score across our benchmark mix — worth shortlisting when raw quality matters more than VRAM budget. Newly released, so production-readiness is still being shaken out.

Run this onACEMAGIC M1A Pro (i9-13900HK + ARC A770)Cheapest card in our directory with comfortable headroom (16 GB) for this model at Q4 (~10.5 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Vision

Reasoning

Function Calling

Multilingual

Math

Instruction Following

Model Specifications

Parameters25.2B

Active Params3.8B

ArchitectureMoE

Context Length256K tokens

ModalityMultimodal

Training CutoffJanuary 2025

ProviderGoogle

Download Size51.7 GB

Community

Monthly Downloads1.1M

Likes1.1K

Last Updated16 days ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

No benchmark data available for this model yet.

MBA Open Score

84.0AA

Benchmark40%

89.0

Popularity25%

59.4

Efficiency20%

94.4

Versatility15%

97.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	9.7 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	10.5 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	10.9 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	11.3 GB	Excellent	Near-lossless quality with manageable size
Q8_0	12.3 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	15.9 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


AMD Radeon RX 7800 XTAMD	SS	47.9 tok/s	10.5 GB
AMD Radeon RX 7900 XTAMD	SS	61.4 tok/s	10.5 GB
AMD Radeon RX 9070AMD	SS	49.1 tok/s	10.5 GB
AMD Radeon RX 9070 XTAMD	SS	49.1 tok/s	10.5 GB
Google Cloud TPU v5eGoogle	SS	62.9 tok/s	10.5 GB
Intel Arc A770 16GBIntel	SS	43.0 tok/s	10.5 GB
NOVATECH AI Workstation (i9-14900K + RTX 5080)NOVATECH	SS	73.7 tok/s	10.5 GB
NVIDIA GeForce RTX 4070 Ti SUPERNVIDIA	SS	51.6 tok/s	10.5 GB
NVIDIA GeForce RTX 4080 SUPERNVIDIA	SS	56.5 tok/s	10.5 GB
NVIDIA GeForce RTX 5070 TiNVIDIA	SS	68.8 tok/s	10.5 GB
NVIDIA GeForce RTX 5080 Founders EditionNVIDIA	SS	73.7 tok/s	10.5 GB
ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	39.3 tok/s	10.5 GB
AMD Radeon RX 7900 XTXAMD	SS	73.7 tok/s	10.5 GB
NVIDIA GeForce RTX 3090NVIDIA	SS	71.9 tok/s	10.5 GB
NVIDIA GeForce RTX 4090 Founders EditionNVIDIA	SS	77.4 tok/s	10.5 GB
Google Cloud TPU v6e (Trillium)Google	SS	125.9 tok/s	10.5 GB
NVIDIA GeForce RTX 5090 Founders EditionNVIDIA	SS	137.6 tok/s	10.5 GB
Origin PC M-CLASS v2Origin PC	SS	137.6 tok/s	10.5 GB
NVIDIA GeForce RTX 5060 Ti 16GBNVIDIA	SS	34.4 tok/s	10.5 GB
NVIDIA L40SNVIDIA	SS	66.3 tok/s	10.5 GB
NVIDIA RTX 6000 Ada GenerationNVIDIA	SS	73.7 tok/s	10.5 GB
Origin PC L-CLASS v2Origin PC	SS	73.7 tok/s	10.5 GB
NVIDIA A100 SXM4 80GBNVIDIA	SS	156.5 tok/s	10.5 GB
NVIDIA H100 SXM5 80GBNVIDIA	SS	257.2 tok/s	10.5 GB
NVIDIA GeForce RTX 5070NVIDIA	SS	51.6 tok/s	10.5 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on AMD Radeon RX 7700 XT (~33 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)DiffusionGemma 26B-A4B on AMD Radeon RX 7700 XT · ~33 tok/s · 245W	$0.246
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 10 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA L4Vast.ai · Spot · 24 GB VRAM	$0.03
NVIDIA L4Vast.ai · On-Demand · 24 GB VRAM	$0.04
NVIDIA GeForce RTX 5060 TiVast.ai · Spot · 16 GB VRAM	$0.09
NVIDIA GeForce RTX 5060 TiVast.ai · On-Demand · 16 GB VRAM	$0.10
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

Google DeepMind’s DiffusionGemma 26B-A4B is the first open-weight text-diffusion model built for production inference. Rather than generating tokens one at a time in a left-to-right autoregressive chain, it denoises entire blocks of 256 tokens in parallel. This shifts the primary bottleneck from memory bandwidth to compute, delivering over 1,000 tokens per second on a single H100 and up to 4× faster generation than its Gemma 4 sibling at the same parameter count.

The model is multimodal—accepting text, image, and video input—and features a 256K token context window. It ships under Apache 2.0, making it freely usable for research, commercial products, and local deployment. DiffusionGemma is not a drop-in replacement for every LLM workload; it trades a measured quality regression for a massive improvement in generation speed. If your application needs real-time text output, code infilling, or interactive editing, this model is purpose-built for that use case.

Architecture & Technical Details

DiffusionGemma 26B-A4B uses a Mixture-of-Experts (MoE) encoder-decoder architecture with 25.2B total parameters and only 3.8B active per forward pass. The model activates 8 out of 128 total experts (plus one shared expert) across 30 layers. This sparsity keeps memory footprint low while maintaining strong reasoning capacity—a key design choice for local execution.

The generation process works in three stages:

Encoder prefill – The autoregressive encoder processes the prompt context and caches it as a KV cache.
Canvas denoising – The decoder starts with a block of 256 random tokens and iteratively refines them over multiple steps using discrete diffusion, with bidirectional attention over the canvas. Every token can attend to every other token in the block, enabling self-correction and global coherence.
Canvas commit – Once denoised, the block is appended to the KV cache, and the encoder prepares for the next canvas.

This block-autoregressive approach bypasses the sequential decode bottleneck that limits traditional LLMs. Instead of moving model weights from memory to compute for each token, DiffusionGemma feeds the GPU a large parallel workload, keeping tensor cores saturated.

The model supports a native system role for structured prompts, a sliding window of 1024 tokens, and a vocabulary size of 262K tokens. The vision encoder adds roughly 550M parameters for image/video processing.

Capabilities & Use Cases

DiffusionGemma excels in scenarios where generation speed matters more than maximum benchmark accuracy. Its capabilities span:

Chat & instruction-following – Natural dialogue and task completion with configurable thinking modes (reasoning enabled/disabled).
Code generation & infilling – Bidirectional attention during generation is a major advantage for code completion and inline editing. The model can evaluate the entire code block simultaneously, making it ideal for IDEs and real-time coding assistants.
Vision & multimodal reasoning – Accepts images and videos interleaved with text. Supports variable aspect ratios and resolutions. Scores 70.5 on MATH-Vision, indicating strong performance on visual math problems.
Multilingual text – Supports multiple languages (exact list not published, but built on Gemma 4’s multilingual backbone).
Function calling – Can integrate with external tools and APIs when prompted appropriately.
Math & reasoning – Scores 77.6 on MMLU Pro, 73.2 on GPQA Diamond, 69.1 on AIME 2026 (no tools), and 69.1 on LiveCodeBench v6. These numbers are 5–19 points below Gemma 4 26B on the same benchmarks, so choose this model only if speed is the priority.

Best-fit use cases:

Real-time chat applications with low latency requirements
Code editing tools (e.g., IDE autocomplete, code review comments)
Interactive document drafting with inline corrections
Any pipeline where you need fast streaming output and can tolerate moderate accuracy tradeoffs

Running DiffusionGemma 26B-A4B Locally

This model is designed to run on a single capable GPU, not a cluster. Here is what you need to know to run DiffusionGemma 26B-A4B locally.

VRAM Requirements

Full FP16 – Requires roughly 48 GB VRAM for the full 25.2B model (active + inactive experts). Not feasible on consumer GPUs.
Q4_K_M quantization – Fits comfortably in 18–20 GB VRAM. This is the recommended default for most users. With 3.8B active parameters, the model is highly efficient even when quantized.
Q4 (4-bit) – Can run on an RTX 4090 (24 GB) or RTX 5090 (32 GB) with room for KV cache at 256K context.
Q3 (3-bit) – Possible on 16 GB GPUs (e.g., RTX 4070 Ti) but expect quality degradation.

Recommended Hardware

Best performance – Single NVIDIA H100 (80 GB). Achieves 1,000+ tokens per second.
Consumer flagship – RTX 5090 (32 GB) or RTX 4090 (24 GB) with Q4_K_M quantization. Expect 700+ tokens per second on the 5090, ~500+ on the 4090.
Apple Silicon – M4 Max with 36 GB+ unified memory can run Q4_K_M via MLX, but token throughput will be lower (~200–300 t/s). M4 Ultra (64+ GB) improves that.
Minimum viable – Any GPU with 16 GB VRAM and support for FP8/FP4 (RTX 40-series, AMD RX 7900 XTX with ROCm). Expect 3–5 t/s with aggressive quantization.

Getting Started

The quickest way to run DiffusionGemma 26B-A4B locally is via Ollama. After installing Ollama, pull the model:

1ollama pull diffusiongemma:26b

Then run:

1ollama run diffusiongemma:26b

For more control, use the Hugging Face transformers library with the official Google repository. The model supports NVFP4 (NVIDIA’s 4-bit floating-point) on Blackwell GPUs for further acceleration.

Performance Expectations

Tokens per second – Varies by hardware and quantization. On an RTX 4090 with Q4_K_M, expect 400–600 t/s. On an H100, 1,000+ t/s. On a mid-range GPU (16 GB), expect 100–200 t/s.
Generation latency – First token latency is higher than autoregressive models because the entire canvas must be denoised initially. After the first block, subsequent tokens stream out at canvas rate (~256 tokens per denoising step).

How It Compares

vs. Gemma 4 26B-A4B (Standard Autoregressive)

Metric	DiffusionGemma 26B	Gemma 4 26B
MMLU Pro	77.6	82.6
AIME 2026 (no tools)	69.1	~79
Generation speed (H100)	1,008 t/s	~250 t/s
Code infilling	Native	Not designed for this

Choose DiffusionGemma when speed and bidirectional generation are critical. Choose Gemma 4 if you need maximum accuracy and can tolerate slower inference.

vs. DeepSeek V2 (236B, MoE)

Both are MoE, but DeepSeek V2 has a 236B total parameter count and ~21B active. DeepSeek V2 scores higher on complex math and coding benchmarks but requires multiple GPUs for local deployment (80 GB+ VRAM). DiffusionGemma is more practical for single-GPU setups and is much faster on consumer hardware.

vs. Llama 3.1 70B (Dense)

Llama 3.1 70B requires 70 GB even for quantized versions, making it infeasible for most local hardware. DiffusionGemma’s MoE efficiency (3.8B active) offers a dramatic VRAM advantage while providing competitive speed for its parameter class.

Bottom line: DiffusionGemma 26B-A4B is the fastest open-weight model under 25B total parameters for local deployment. If your workload is latency-sensitive and you can tolerate benchmark gaps of 5–15 points vs. top dense models, this is the clear choice. For maximum accuracy on math, reasoning, or complex multilingual tasks, stick with standard autoregressive models at the same active parameter count.

Related Models

Google

Explore the Provider

See all Google models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Google model we track.

Open Google

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

25.2B

Google

DiffusionGemma 26B-A4B

25.2B paramsMoE256K ctxMultimodal

View on Hugging Face Official Page

Our Take

Best for: Workstation-class chat and code with usable context

Run this onACEMAGIC M1A Pro (i9-13900HK + ARC A770)Cheapest card in our directory with comfortable headroom (16 GB) for this model at Q4 (~10.5 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Vision

Reasoning

Function Calling

Multilingual

Math

Instruction Following

Model Specifications

Parameters25.2B

Active Params3.8B

ArchitectureMoE

Context Length256K tokens

ModalityMultimodal

Training CutoffJanuary 2025

ProviderGoogle

Download Size51.7 GB

Community

Monthly Downloads1.1M

Likes1.1K

Last Updated16 days ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

No benchmark data available for this model yet.

MBA Open Score

84.0AA

Benchmark40%

89.0

Popularity25%

59.4

Efficiency20%

94.4

Versatility15%

97.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	9.7 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	10.5 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	10.9 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	11.3 GB	Excellent	Near-lossless quality with manageable size
Q8_0	12.3 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	15.9 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


AMD Radeon RX 7800 XTAMD	SS	47.9 tok/s	10.5 GB
AMD Radeon RX 7900 XTAMD	SS	61.4 tok/s	10.5 GB
AMD Radeon RX 9070AMD	SS	49.1 tok/s	10.5 GB
AMD Radeon RX 9070 XTAMD	SS	49.1 tok/s	10.5 GB
Google Cloud TPU v5eGoogle	SS	62.9 tok/s	10.5 GB
Intel Arc A770 16GBIntel	SS	43.0 tok/s	10.5 GB
NOVATECH AI Workstation (i9-14900K + RTX 5080)NOVATECH	SS	73.7 tok/s	10.5 GB
NVIDIA GeForce RTX 4070 Ti SUPERNVIDIA	SS	51.6 tok/s	10.5 GB
NVIDIA GeForce RTX 4080 SUPERNVIDIA	SS	56.5 tok/s	10.5 GB
NVIDIA GeForce RTX 5070 TiNVIDIA	SS	68.8 tok/s	10.5 GB
NVIDIA GeForce RTX 5080 Founders EditionNVIDIA	SS	73.7 tok/s	10.5 GB
ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	39.3 tok/s	10.5 GB
AMD Radeon RX 7900 XTXAMD	SS	73.7 tok/s	10.5 GB
NVIDIA GeForce RTX 3090NVIDIA	SS	71.9 tok/s	10.5 GB
NVIDIA GeForce RTX 4090 Founders EditionNVIDIA	SS	77.4 tok/s	10.5 GB
Google Cloud TPU v6e (Trillium)Google	SS	125.9 tok/s	10.5 GB
NVIDIA GeForce RTX 5090 Founders EditionNVIDIA	SS	137.6 tok/s	10.5 GB
Origin PC M-CLASS v2Origin PC	SS	137.6 tok/s	10.5 GB
NVIDIA GeForce RTX 5060 Ti 16GBNVIDIA	SS	34.4 tok/s	10.5 GB
NVIDIA L40SNVIDIA	SS	66.3 tok/s	10.5 GB
NVIDIA RTX 6000 Ada GenerationNVIDIA	SS	73.7 tok/s	10.5 GB
Origin PC L-CLASS v2Origin PC	SS	73.7 tok/s	10.5 GB
NVIDIA A100 SXM4 80GBNVIDIA	SS	156.5 tok/s	10.5 GB
NVIDIA H100 SXM5 80GBNVIDIA	SS	257.2 tok/s	10.5 GB
NVIDIA GeForce RTX 5070NVIDIA	SS	51.6 tok/s	10.5 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on AMD Radeon RX 7700 XT (~33 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)DiffusionGemma 26B-A4B on AMD Radeon RX 7700 XT · ~33 tok/s · 245W	$0.246
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 10 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA L4Vast.ai · Spot · 24 GB VRAM	$0.03
NVIDIA L4Vast.ai · On-Demand · 24 GB VRAM	$0.04
NVIDIA GeForce RTX 5060 TiVast.ai · Spot · 16 GB VRAM	$0.09
NVIDIA GeForce RTX 5060 TiVast.ai · On-Demand · 16 GB VRAM	$0.10
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

Architecture & Technical Details

The generation process works in three stages:

Encoder prefill – The autoregressive encoder processes the prompt context and caches it as a KV cache.
Canvas denoising – The decoder starts with a block of 256 random tokens and iteratively refines them over multiple steps using discrete diffusion, with bidirectional attention over the canvas. Every token can attend to every other token in the block, enabling self-correction and global coherence.
Canvas commit – Once denoised, the block is appended to the KV cache, and the encoder prepares for the next canvas.

Capabilities & Use Cases

DiffusionGemma excels in scenarios where generation speed matters more than maximum benchmark accuracy. Its capabilities span:

Chat & instruction-following – Natural dialogue and task completion with configurable thinking modes (reasoning enabled/disabled).
Code generation & infilling – Bidirectional attention during generation is a major advantage for code completion and inline editing. The model can evaluate the entire code block simultaneously, making it ideal for IDEs and real-time coding assistants.
Vision & multimodal reasoning – Accepts images and videos interleaved with text. Supports variable aspect ratios and resolutions. Scores 70.5 on MATH-Vision, indicating strong performance on visual math problems.
Multilingual text – Supports multiple languages (exact list not published, but built on Gemma 4’s multilingual backbone).
Function calling – Can integrate with external tools and APIs when prompted appropriately.
Math & reasoning – Scores 77.6 on MMLU Pro, 73.2 on GPQA Diamond, 69.1 on AIME 2026 (no tools), and 69.1 on LiveCodeBench v6. These numbers are 5–19 points below Gemma 4 26B on the same benchmarks, so choose this model only if speed is the priority.

Best-fit use cases:

Real-time chat applications with low latency requirements
Code editing tools (e.g., IDE autocomplete, code review comments)
Interactive document drafting with inline corrections
Any pipeline where you need fast streaming output and can tolerate moderate accuracy tradeoffs

Running DiffusionGemma 26B-A4B Locally

This model is designed to run on a single capable GPU, not a cluster. Here is what you need to know to run DiffusionGemma 26B-A4B locally.

VRAM Requirements

Full FP16 – Requires roughly 48 GB VRAM for the full 25.2B model (active + inactive experts). Not feasible on consumer GPUs.
Q4_K_M quantization – Fits comfortably in 18–20 GB VRAM. This is the recommended default for most users. With 3.8B active parameters, the model is highly efficient even when quantized.
Q4 (4-bit) – Can run on an RTX 4090 (24 GB) or RTX 5090 (32 GB) with room for KV cache at 256K context.
Q3 (3-bit) – Possible on 16 GB GPUs (e.g., RTX 4070 Ti) but expect quality degradation.

Recommended Hardware

Best performance – Single NVIDIA H100 (80 GB). Achieves 1,000+ tokens per second.
Consumer flagship – RTX 5090 (32 GB) or RTX 4090 (24 GB) with Q4_K_M quantization. Expect 700+ tokens per second on the 5090, ~500+ on the 4090.
Apple Silicon – M4 Max with 36 GB+ unified memory can run Q4_K_M via MLX, but token throughput will be lower (~200–300 t/s). M4 Ultra (64+ GB) improves that.
Minimum viable – Any GPU with 16 GB VRAM and support for FP8/FP4 (RTX 40-series, AMD RX 7900 XTX with ROCm). Expect 3–5 t/s with aggressive quantization.

Getting Started

The quickest way to run DiffusionGemma 26B-A4B locally is via Ollama. After installing Ollama, pull the model:

1ollama pull diffusiongemma:26b

Then run:

1ollama run diffusiongemma:26b

Performance Expectations

Tokens per second – Varies by hardware and quantization. On an RTX 4090 with Q4_K_M, expect 400–600 t/s. On an H100, 1,000+ t/s. On a mid-range GPU (16 GB), expect 100–200 t/s.
Generation latency – First token latency is higher than autoregressive models because the entire canvas must be denoised initially. After the first block, subsequent tokens stream out at canvas rate (~256 tokens per denoising step).

How It Compares

vs. Gemma 4 26B-A4B (Standard Autoregressive)

Metric	DiffusionGemma 26B	Gemma 4 26B
MMLU Pro	77.6	82.6
AIME 2026 (no tools)	69.1	~79
Generation speed (H100)	1,008 t/s	~250 t/s
Code infilling	Native	Not designed for this

Choose DiffusionGemma when speed and bidirectional generation are critical. Choose Gemma 4 if you need maximum accuracy and can tolerate slower inference.