Google

Gemma 4 26B-A4B IT

Google's Mixture-of-Experts model with 26B total parameters but only ~4B active during inference. 256K context, configurable thinking modes, multimodal (text, image, video). Ranks #6 on Arena AI with estimated score of 1441. Runs almost as fast as a 4B model despite 26B total params.

26B paramsMoE256K ctxText + Vision

View on Hugging Face

Run with Ollama Official Page

Our Take

Best for: Strongest at competition math (AIME 2026) in its size class

A strong 26B-parameter MoE language model from Google. Pulls ahead on competition math (AIME 2026) (88/100), so reach for it when that's the dimension that matters. On the rise in download charts.

Run this onACEMAGIC M1A Pro (i9-13900HK + ARC A770)Cheapest card in our directory with comfortable headroom (16 GB) for this model at Q4 (~11.0 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Vision

Reasoning

Function Calling

Multilingual

Math

Instruction Following

Model Specifications

Parameters26B

Active Params4B

ArchitectureMoE

Context Length256K tokens

ModalityText + Vision

ProviderGoogle

Download Size204.8 GB

Community

Monthly Downloads9.4M

Likes984

Last Updated4 days ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run gemma4:26b

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Gemma Terms of UseView Full License

Performance & Scoring

Benchmarks

82.3

82.6

8.7

88.3

90.2

Overall Score

75.6AA

Benchmark40%

70.4

Popularity25%

77.9

Efficiency20%

80.3

Versatility15%

79.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	10.2 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	11.0 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	11.4 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	11.9 GB	Excellent	Near-lossless quality with manageable size
Q8_0	12.9 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	16.7 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


AMD Radeon RX 7800 XTAMD	SS	45.6 tok/s	11.0 GB
AMD Radeon RX 7900 XTAMD	SS	58.5 tok/s	11.0 GB
AMD Radeon RX 9070AMD	SS	46.8 tok/s	11.0 GB
AMD Radeon RX 9070 XTAMD	SS	46.8 tok/s	11.0 GB
Google Cloud TPU v5eGoogle	SS	59.9 tok/s	11.0 GB
Intel Arc A770 16GBIntel	SS	40.9 tok/s	11.0 GB
NOVATECH AI Workstation (i9-14900K + RTX 5080)NOVATECH	SS	70.2 tok/s	11.0 GB
NVIDIA GeForce RTX 4070 Ti SUPERNVIDIA	SS	49.1 tok/s	11.0 GB
NVIDIA GeForce RTX 4080 SUPERNVIDIA	SS	53.8 tok/s	11.0 GB
NVIDIA GeForce RTX 5070 TiNVIDIA	SS	65.5 tok/s	11.0 GB
NVIDIA GeForce RTX 5080 Founders EditionNVIDIA	SS	70.2 tok/s	11.0 GB
AMD Radeon RX 7900 XTXAMD	SS	70.2 tok/s	11.0 GB
NVIDIA GeForce RTX 3090NVIDIA	SS	68.4 tok/s	11.0 GB
NVIDIA GeForce RTX 4090 Founders EditionNVIDIA	SS	73.7 tok/s	11.0 GB
ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	37.4 tok/s	11.0 GB
Google Cloud TPU v6e (Trillium)Google	SS	119.9 tok/s	11.0 GB
NVIDIA GeForce RTX 5090 Founders EditionNVIDIA	SS	131.0 tok/s	11.0 GB
Origin PC M-CLASS v2Origin PC	SS	131.0 tok/s	11.0 GB
NVIDIA GeForce RTX 5060 Ti 16GBNVIDIA	SS	32.7 tok/s	11.0 GB
NVIDIA L40SNVIDIA	SS	63.2 tok/s	11.0 GB
NVIDIA RTX 6000 Ada GenerationNVIDIA	SS	70.2 tok/s	11.0 GB
Origin PC L-CLASS v2Origin PC	SS	70.2 tok/s	11.0 GB
NVIDIA A100 SXM4 80GBNVIDIA	SS	149.1 tok/s	11.0 GB
NVIDIA H100 SXM5 80GBNVIDIA	SS	244.9 tok/s	11.0 GB
Google Cloud TPU v5pGoogle	SS	202.1 tok/s	11.0 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on AMD Radeon RX 7700 XT (~32 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Gemma 4 26B-A4B IT on AMD Radeon RX 7700 XT · ~32 tok/s · 245W	$0.259
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 11 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA GeForce RTX 3090Vast.ai · Spot · 24 GB VRAM	$0.05
NVIDIA GeForce RTX 3090Vast.ai · On-Demand · 24 GB VRAM	$0.07
NVIDIA GeForce RTX 5070Vast.ai · Spot · 12 GB VRAM	$0.08
NVIDIA GeForce RTX 5060 TiVast.ai · Spot · 16 GB VRAM	$0.08
NVIDIA GeForce RTX 5060 TiVast.ai · On-Demand · 16 GB VRAM	$0.08

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Gemma 4 26B-A4B IT represents Google’s most efficient approach to high-reasoning local AI. By utilizing a Mixture-of-Experts (MoE) architecture, this model provides the logical depth of a 26B parameter model with the inference latency of a 4B parameter model. It is designed specifically for developers and researchers who need a multimodal, long-context engine that can handle complex reasoning, coding, and vision tasks without requiring a multi-GPU server rack.

Positioned as a high-performance alternative to dense models like Llama 3.1 8B or Mistral NeMo 12B, the Gemma 4 26B-A4B IT excels in instruction-following and structured data extraction. It currently ranks #6 on the LMSYS Chatbot Arena with an estimated score of 1441, placing it firmly in the "frontier" category for local weights. Its primary draw is the 256,000 token context window, which allows for massive document analysis and codebase ingestion that was previously reserved for closed-source APIs.

Architecture & Technical Details

The "A4B" in the name refers to "Active 4 Billion" parameters. While the model has a total footprint of 26 billion parameters, only approximately 4 billion are activated during any single inference pass. This MoE architecture is the key to the Gemma 4 26B-A4B IT MoE efficiency; you get the knowledge capacity of a larger model, but the Gemma 4 26B-A4B IT tokens per second remain high because the compute requirement is significantly lower than a 26B dense model.

The 256K context length is implemented via advanced rotary positional embeddings (RoPE), enabling the model to maintain coherence across extremely long prompts. This makes it a premier local AI model 26B parameters 2025 candidate for "needle-in-a-haystack" tasks. Furthermore, the model is natively multimodal, meaning the vision capabilities are baked into the architecture rather than being a bolted-on adapter. This allows for seamless transitions between analyzing a screenshot of code and generating the corresponding refactored text.

Capabilities & Use Cases

Gemma 4 26B-A4B IT is a generalist with a heavy lean toward technical and logical tasks. It is not just a chat model; it is a functional tool for local pipelines.

Advanced Coding: For practitioners looking to run Gemma 4 26B-A4B IT for coding, the model handles complex refactoring, boilerplate generation in Rust and Go, and debugging across large files. Its ability to "think" through a problem before responding makes it highly effective for logic-heavy programming tasks.
Vision & Multimodal Analysis: The model can process images and video frames to perform OCR, visual reasoning (e.g., "What is wrong with this circuit diagram?"), and document understanding. This is particularly useful for automating data entry from scanned technical manuals.
Complex Reasoning & Math: The Gemma 4 26B-A4B IT reasoning benchmark scores indicate a high proficiency in multi-step problem solving. It features configurable thinking modes, allowing users to trade off speed for deeper chain-of-thought processing when solving mathematical proofs or architectural design problems.
Function-Calling: It is highly reliable at generating structured JSON outputs and calling external tools, making it an ideal core for local agentic workflows.

Running Gemma 4 26B-A4B IT Locally

To run Gemma 4 26B-A4B IT locally, your primary constraint is VRAM, not compute speed. Because it is an MoE model, it requires the full 26B parameters to be loaded into memory, even though it only "uses" 4B per token.

Gemma 4 26B-A4B IT VRAM Requirements

Memory requirements scale based on your chosen quantization level. To maintain the model's reasoning integrity, we recommend the following:

Q4_K_M (Recommended): Requires ~16 GB to 18 GB of VRAM. This is the "sweet spot" for most users, offering near-FP16 performance with a significant reduction in memory footprint.
Q6_K / Q8_0: Requires 22 GB to 28 GB of VRAM. Best for users who need maximum precision for mathematical or scientific workloads.
Q2_K / Q3_K_S: Requires 10 GB to 12 GB. Expect some degradation in logic and vision accuracy.

Hardware Recommendations

The best GPU for Gemma 4 26B-A4B IT is an NVIDIA RTX 4090 (24GB). This allows you to run a high-quality 4-bit or 5-bit quantization with enough headroom for a moderate KV cache (context).

NVIDIA RTX 3090 / 4090 (24GB): Ideal. Runs 4-bit quantization with high tokens per second.
Mac Studio / MacBook Pro (M2/M3/M4 Max with 32GB+ RAM): Excellent choice for utilizing the 256K context window, as Unified Memory allows for larger KV caches that would overflow a standard GPU.
RTX 4070 Ti Super (16GB): Can run the model at 3.5-bit or 4-bit quantization, but context will be limited. This is the minimum entry point for a smooth experience.

Deployment Software

The quickest way to deploy is via Ollama. Simply run ollama run gemma4:26b to pull the default quantized version. For power users, LM Studio or KoboldCPP allow for more granular control over layer offloading and context management, which is vital when trying to run 26B model on consumer GPU setups with limited VRAM.

How It Compares

When evaluating Gemma 4 26B-A4B IT vs competitors, the most common comparisons are against Mistral Small (22B) and Llama 3.1 8B.

Gemma 4 26B-A4B IT vs Mistral Small (22B): Mistral Small is a dense model. While Mistral may feel more "consistent" in prose, Gemma 4 is significantly faster during inference due to its MoE architecture. Gemma 4 also tends to outperform Mistral in vision-based tasks and long-context retrieval.
Gemma 4 26B-A4B IT vs Llama 3.1 8B: Llama 3.1 8B is easier to fit on 12GB cards (like the RTX 3060/4070), but it lacks the deep reasoning and large-scale knowledge of Gemma's 26B total parameters. If you have the VRAM, the Gemma 4 26B-A4B IT hardware requirements are worth meeting for the jump in intelligence.

In summary, Gemma 4 26B-A4B IT is the current gold standard for 20GB+ VRAM hardware. It offers a rare combination of high-speed inference and high-parameter reasoning, making it the most versatile model for local developers today.

Related Models

Google

Explore the Provider

See all Google models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Google model we track.

Open Google

Explore the Family

See every Gemma release

The full Gemma family leaderboard with sizes, benchmark scores, and a release timeline.

Open Gemma

Find the Best Hardware for This Model

Use our hardware calculator to find the optimal device for running this model.

Google

Gemma 4 26B-A4B IT

26B paramsMoE256K ctxText + Vision

View on Hugging Face

Run with Ollama Official Page

Our Take

Best for: Strongest at competition math (AIME 2026) in its size class

A strong 26B-parameter MoE language model from Google. Pulls ahead on competition math (AIME 2026) (88/100), so reach for it when that's the dimension that matters. On the rise in download charts.

Run this onACEMAGIC M1A Pro (i9-13900HK + ARC A770)Cheapest card in our directory with comfortable headroom (16 GB) for this model at Q4 (~11.0 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Vision

Reasoning

Function Calling

Multilingual

Math

Instruction Following

Model Specifications

Parameters26B

Active Params4B

ArchitectureMoE

Context Length256K tokens

ModalityText + Vision

ProviderGoogle

Download Size204.8 GB

Community

Monthly Downloads9.4M

Likes984

Last Updated4 days ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run gemma4:26b

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Gemma Terms of UseView Full License

Performance & Scoring

Benchmarks

82.3

82.6

8.7

88.3

90.2

Overall Score

75.6AA

Benchmark40%

70.4

Popularity25%

77.9

Efficiency20%

80.3

Versatility15%

79.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	10.2 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	11.0 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	11.4 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	11.9 GB	Excellent	Near-lossless quality with manageable size
Q8_0	12.9 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	16.7 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


AMD Radeon RX 7800 XTAMD	SS	45.6 tok/s	11.0 GB
AMD Radeon RX 7900 XTAMD	SS	58.5 tok/s	11.0 GB
AMD Radeon RX 9070AMD	SS	46.8 tok/s	11.0 GB
AMD Radeon RX 9070 XTAMD	SS	46.8 tok/s	11.0 GB
Google Cloud TPU v5eGoogle	SS	59.9 tok/s	11.0 GB
Intel Arc A770 16GBIntel	SS	40.9 tok/s	11.0 GB
NOVATECH AI Workstation (i9-14900K + RTX 5080)NOVATECH	SS	70.2 tok/s	11.0 GB
NVIDIA GeForce RTX 4070 Ti SUPERNVIDIA	SS	49.1 tok/s	11.0 GB
NVIDIA GeForce RTX 4080 SUPERNVIDIA	SS	53.8 tok/s	11.0 GB
NVIDIA GeForce RTX 5070 TiNVIDIA	SS	65.5 tok/s	11.0 GB
NVIDIA GeForce RTX 5080 Founders EditionNVIDIA	SS	70.2 tok/s	11.0 GB
AMD Radeon RX 7900 XTXAMD	SS	70.2 tok/s	11.0 GB
NVIDIA GeForce RTX 3090NVIDIA	SS	68.4 tok/s	11.0 GB
NVIDIA GeForce RTX 4090 Founders EditionNVIDIA	SS	73.7 tok/s	11.0 GB
ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	37.4 tok/s	11.0 GB
Google Cloud TPU v6e (Trillium)Google	SS	119.9 tok/s	11.0 GB
NVIDIA GeForce RTX 5090 Founders EditionNVIDIA	SS	131.0 tok/s	11.0 GB
Origin PC M-CLASS v2Origin PC	SS	131.0 tok/s	11.0 GB
NVIDIA GeForce RTX 5060 Ti 16GBNVIDIA	SS	32.7 tok/s	11.0 GB
NVIDIA L40SNVIDIA	SS	63.2 tok/s	11.0 GB
NVIDIA RTX 6000 Ada GenerationNVIDIA	SS	70.2 tok/s	11.0 GB
Origin PC L-CLASS v2Origin PC	SS	70.2 tok/s	11.0 GB
NVIDIA A100 SXM4 80GBNVIDIA	SS	149.1 tok/s	11.0 GB
NVIDIA H100 SXM5 80GBNVIDIA	SS	244.9 tok/s	11.0 GB
Google Cloud TPU v5pGoogle	SS	202.1 tok/s	11.0 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on AMD Radeon RX 7700 XT (~32 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Gemma 4 26B-A4B IT on AMD Radeon RX 7700 XT · ~32 tok/s · 245W	$0.259
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 11 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA GeForce RTX 3090Vast.ai · Spot · 24 GB VRAM	$0.05
NVIDIA GeForce RTX 3090Vast.ai · On-Demand · 24 GB VRAM	$0.07
NVIDIA GeForce RTX 5070Vast.ai · Spot · 12 GB VRAM	$0.08
NVIDIA GeForce RTX 5060 TiVast.ai · Spot · 16 GB VRAM	$0.08
NVIDIA GeForce RTX 5060 TiVast.ai · On-Demand · 16 GB VRAM	$0.08

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Architecture & Technical Details

Capabilities & Use Cases

Gemma 4 26B-A4B IT is a generalist with a heavy lean toward technical and logical tasks. It is not just a chat model; it is a functional tool for local pipelines.

Advanced Coding: For practitioners looking to run Gemma 4 26B-A4B IT for coding, the model handles complex refactoring, boilerplate generation in Rust and Go, and debugging across large files. Its ability to "think" through a problem before responding makes it highly effective for logic-heavy programming tasks.
Vision & Multimodal Analysis: The model can process images and video frames to perform OCR, visual reasoning (e.g., "What is wrong with this circuit diagram?"), and document understanding. This is particularly useful for automating data entry from scanned technical manuals.
Complex Reasoning & Math: The Gemma 4 26B-A4B IT reasoning benchmark scores indicate a high proficiency in multi-step problem solving. It features configurable thinking modes, allowing users to trade off speed for deeper chain-of-thought processing when solving mathematical proofs or architectural design problems.
Function-Calling: It is highly reliable at generating structured JSON outputs and calling external tools, making it an ideal core for local agentic workflows.

Running Gemma 4 26B-A4B IT Locally

Gemma 4 26B-A4B IT VRAM Requirements

Memory requirements scale based on your chosen quantization level. To maintain the model's reasoning integrity, we recommend the following:

Q4_K_M (Recommended): Requires ~16 GB to 18 GB of VRAM. This is the "sweet spot" for most users, offering near-FP16 performance with a significant reduction in memory footprint.
Q6_K / Q8_0: Requires 22 GB to 28 GB of VRAM. Best for users who need maximum precision for mathematical or scientific workloads.
Q2_K / Q3_K_S: Requires 10 GB to 12 GB. Expect some degradation in logic and vision accuracy.

Hardware Recommendations

The best GPU for Gemma 4 26B-A4B IT is an NVIDIA RTX 4090 (24GB). This allows you to run a high-quality 4-bit or 5-bit quantization with enough headroom for a moderate KV cache (context).

NVIDIA RTX 3090 / 4090 (24GB): Ideal. Runs 4-bit quantization with high tokens per second.
Mac Studio / MacBook Pro (M2/M3/M4 Max with 32GB+ RAM): Excellent choice for utilizing the 256K context window, as Unified Memory allows for larger KV caches that would overflow a standard GPU.
RTX 4070 Ti Super (16GB): Can run the model at 3.5-bit or 4-bit quantization, but context will be limited. This is the minimum entry point for a smooth experience.

Deployment Software

How It Compares

When evaluating Gemma 4 26B-A4B IT vs competitors, the most common comparisons are against Mistral Small (22B) and Llama 3.1 8B.

Gemma 4 26B-A4B IT vs Mistral Small (22B): Mistral Small is a dense model. While Mistral may feel more "consistent" in prose, Gemma 4 is significantly faster during inference due to its MoE architecture. Gemma 4 also tends to outperform Mistral in vision-based tasks and long-context retrieval.
Gemma 4 26B-A4B IT vs Llama 3.1 8B: Llama 3.1 8B is easier to fit on 12GB cards (like the RTX 3060/4070), but it lacks the deep reasoning and large-scale knowledge of Gemma's 26B total parameters. If you have the VRAM, the Gemma 4 26B-A4B IT hardware requirements are worth meeting for the jump in intelligence.