made by agents

Google's Mixture-of-Experts model with 26B total parameters but only ~4B active during inference. 256K context, configurable thinking modes, multimodal (text, image, video). Ranks #6 on Arena AI with estimated score of 1441. Runs almost as fast as a 4B model despite 26B total params.
Copy and paste this command to start running the model locally.
ollama run gemma4:26bAccess model weights, configuration files, and documentation.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 10.2 GB | Low | Aggressive quantization — smallest size, noticeable quality loss |
| Q4_K_MRecommended | 11.0 GB | Good | Best balance of size and quality for most use-cases |
| Q5_K_M | 11.4 GB | Very Good | Slightly better quality than Q4 with moderate size increase |
| Q6_K | 11.9 GB | Excellent | Near-lossless quality with manageable size |
| Q8_0 | 12.9 GB | Near Perfect | Virtually indistinguishable from full precision |
| FP16 | 16.7 GB | Full | Full 16-bit floating point — maximum quality, largest size |
See which devices can run this model and at what quality level.
| SS | 45.6 tok/s | 11.0 GB | ||
| SS | 58.5 tok/s | 11.0 GB | ||
| SS | 46.8 tok/s | 11.0 GB | ||
| SS | 46.8 tok/s | 11.0 GB | ||
Google Cloud TPU v5eGoogle | SS | 59.9 tok/s | 11.0 GB | |
Intel Arc A770 16GBIntel | SS | 40.9 tok/s | 11.0 GB | |
| SS | 49.1 tok/s | 11.0 GB | ||
| SS | 53.8 tok/s | 11.0 GB | ||
| SS | 65.5 tok/s | 11.0 GB | ||
| SS | 70.2 tok/s | 11.0 GB | ||
| SS | 70.2 tok/s | 11.0 GB | ||
| SS | 73.7 tok/s | 11.0 GB | ||
| SS | 119.9 tok/s | 11.0 GB | ||
| SS | 131.0 tok/s | 11.0 GB | ||
| SS | 32.7 tok/s | 11.0 GB | ||
NVIDIA L40SNVIDIA | SS | 63.2 tok/s | 11.0 GB | |
| SS | 70.2 tok/s | 11.0 GB | ||
NVIDIA A100 SXM4 80GBNVIDIA | SS | 149.1 tok/s | 11.0 GB | |
NVIDIA H100 SXM5 80GBNVIDIA | SS | 244.9 tok/s | 11.0 GB | |
Google Cloud TPU v5pGoogle | SS | 202.1 tok/s | 11.0 GB | |
| SS | 179.1 tok/s | 11.0 GB | ||
| SS | 44.9 tok/s | 11.0 GB | ||
| SS | 58.5 tok/s | 11.0 GB | ||
| SS | 270.5 tok/s | 11.0 GB | ||
| SS | 44.9 tok/s | 11.0 GB |
Gemma 4 26B-A4B IT represents Google’s most efficient approach to high-reasoning local AI. By utilizing a Mixture-of-Experts (MoE) architecture, this model provides the logical depth of a 26B parameter model with the inference latency of a 4B parameter model. It is designed specifically for developers and researchers who need a multimodal, long-context engine that can handle complex reasoning, coding, and vision tasks without requiring a multi-GPU server rack.
Positioned as a high-performance alternative to dense models like Llama 3.1 8B or Mistral NeMo 12B, the Gemma 4 26B-A4B IT excels in instruction-following and structured data extraction. It currently ranks #6 on the LMSYS Chatbot Arena with an estimated score of 1441, placing it firmly in the "frontier" category for local weights. Its primary draw is the 256,000 token context window, which allows for massive document analysis and codebase ingestion that was previously reserved for closed-source APIs.
The "A4B" in the name refers to "Active 4 Billion" parameters. While the model has a total footprint of 26 billion parameters, only approximately 4 billion are activated during any single inference pass. This MoE architecture is the key to the Gemma 4 26B-A4B IT MoE efficiency; you get the knowledge capacity of a larger model, but the Gemma 4 26B-A4B IT tokens per second remain high because the compute requirement is significantly lower than a 26B dense model.
The 256K context length is implemented via advanced rotary positional embeddings (RoPE), enabling the model to maintain coherence across extremely long prompts. This makes it a premier local AI model 26B parameters 2025 candidate for "needle-in-a-haystack" tasks. Furthermore, the model is natively multimodal, meaning the vision capabilities are baked into the architecture rather than being a bolted-on adapter. This allows for seamless transitions between analyzing a screenshot of code and generating the corresponding refactored text.
Gemma 4 26B-A4B IT is a generalist with a heavy lean toward technical and logical tasks. It is not just a chat model; it is a functional tool for local pipelines.
To run Gemma 4 26B-A4B IT locally, your primary constraint is VRAM, not compute speed. Because it is an MoE model, it requires the full 26B parameters to be loaded into memory, even though it only "uses" 4B per token.
Memory requirements scale based on your chosen quantization level. To maintain the model's reasoning integrity, we recommend the following:
The best GPU for Gemma 4 26B-A4B IT is an NVIDIA RTX 4090 (24GB). This allows you to run a high-quality 4-bit or 5-bit quantization with enough headroom for a moderate KV cache (context).
The quickest way to deploy is via Ollama. Simply run ollama run gemma4:26b to pull the default quantized version. For power users, LM Studio or KoboldCPP allow for more granular control over layer offloading and context management, which is vital when trying to run 26B model on consumer GPU setups with limited VRAM.
When evaluating Gemma 4 26B-A4B IT vs competitors, the most common comparisons are against Mistral Small (22B) and Llama 3.1 8B.
In summary, Gemma 4 26B-A4B IT is the current gold standard for 20GB+ VRAM hardware. It offers a rare combination of high-speed inference and high-parameter reasoning, making it the most versatile model for local developers today.