made by agents

Google's efficient edge model with an 'Effective 4B' parameter footprint using Per-Layer Embeddings (PLE). Supports text, image, video, and native audio input. 128K context. Designed for on-device deployment on phones, laptops, and IoT devices with near-zero latency.
Copy and paste this command to start running the model locally.
ollama run gemma4:e4bAccess model weights, configuration files, and documentation.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 6.1 GB | Low | Aggressive quantization — smallest size, noticeable quality loss |
| Q4_K_MRecommended | 6.9 GB | Good | Best balance of size and quality for most use-cases |
| Q5_K_M | 7.3 GB | Very Good | Slightly better quality than Q4 with moderate size increase |
| Q6_K | 7.8 GB | Excellent | Near-lossless quality with manageable size |
| Q8_0 | 8.8 GB | Near Perfect | Virtually indistinguishable from full precision |
| FP16 | 12.6 GB | Full | Full 16-bit floating point — maximum quality, largest size |
See which devices can run this model and at what quality level.
| SS | 50.3 tok/s | 6.9 GB | ||
Intel Arc B580Intel | SS | 53.1 tok/s | 6.9 GB | |
NVIDIA GeForce RTX 4070NVIDIA | SS | 58.7 tok/s | 6.9 GB | |
| SS | 58.7 tok/s | 6.9 GB | ||
NVIDIA GeForce RTX 5070NVIDIA | SS | 78.2 tok/s | 6.9 GB | |
| SS | 72.6 tok/s | 6.9 GB | ||
| SS | 74.5 tok/s | 6.9 GB | ||
| SS | 74.5 tok/s | 6.9 GB | ||
Google Cloud TPU v5eGoogle | SS | 95.3 tok/s | 6.9 GB | |
Intel Arc A770 16GBIntel | SS | 65.2 tok/s | 6.9 GB | |
| SS | 78.2 tok/s | 6.9 GB | ||
| SS | 85.7 tok/s | 6.9 GB | ||
| SS | 52.1 tok/s | 6.9 GB | ||
| SS | 104.3 tok/s | 6.9 GB | ||
| SS | 111.7 tok/s | 6.9 GB | ||
| SS | 93.1 tok/s | 6.9 GB | ||
| AA | 111.7 tok/s | 6.9 GB | ||
| AA | 117.3 tok/s | 6.9 GB | ||
| AA | 190.9 tok/s | 6.9 GB | ||
| AA | 208.6 tok/s | 6.9 GB | ||
| AA | 33.5 tok/s | 6.9 GB | ||
NVIDIA L40SNVIDIA | AA | 100.6 tok/s | 6.9 GB | |
| AA | 111.7 tok/s | 6.9 GB | ||
| AA | 52.1 tok/s | 6.9 GB | ||
| AA | 46.6 tok/s | 6.9 GB |
Gemma 4 E4B IT is Google’s high-efficiency "Effective 4B" multimodal model, designed specifically for edge deployment. Unlike previous generations that relied on standard dense scaling, this model utilizes Per-Layer Embeddings (PLE) to compress its footprint while maintaining the reasoning capabilities of much larger architectures. It is a native multimodal model, meaning it processes text, images, video, and audio through a unified architecture rather than relying on separate adapter modules.
For developers and engineers, Gemma 4 E4B IT represents a shift toward "near-zero latency" local AI. It competes directly with models like Phi-3.5 Mini and Llama 3.2 3B, offering a significantly larger 128,000-token context window and native audio processing that is often missing in the sub-8B parameter category. This makes it a primary candidate for local agents, real-time voice assistants, and on-device document analysis where data privacy and latency are the primary constraints.
The "Effective 4B" designation refers to a dense architecture that optimizes parameter distribution through PLE. By using specialized embedding layers, the model reduces the memory overhead traditionally associated with high-vocabulary, multilingual models. This architectural choice directly impacts Gemma 4 E4B IT performance, allowing it to fit into the memory constraints of mobile devices and entry-level laptops without the massive perplexity hits usually seen in aggressive pruning.
The standout technical feature is the 128,000 token context length. For a 4B parameter model, this is an outlier. Most models in this weight class struggle with "needle-in-a-haystack" retrieval beyond 8K or 32K tokens. Gemma 4 E4B IT is engineered to maintain coherence over long-form inputs, making it viable for analyzing entire codebases or long PDF documents locally. As a dense model, every parameter is active during inference, providing consistent throughput and predictable Gemma 4 E4B IT VRAM requirements.
Gemma 4 E4B IT is an instruction-tuned (IT) model, meaning it is optimized for conversational dialogue and following complex system prompts. Its multimodality is native, which is critical for practitioners building integrated applications.
To run Gemma 4 E4B IT locally, your primary constraint will be VRAM. Because this is a 4B parameter model, it is exceptionally accessible for consumer-grade hardware.
The Gemma 4 E4B IT hardware requirements vary based on your choice of quantization. At 4-bit quantization, the model is small enough to run on almost any modern smartphone or laptop.
When deciding on the best quantization for Gemma 4 E4B IT, the standard recommendation for most practitioners is Q4_K_M. This provides a 4.5-bit effective precision that retains nearly all the logic of the FP16 model while cutting the memory footprint by more than half.
ollama run gemma4:4b will automatically pull the optimized manifest and handle the memory allocation for your specific GPU. For developers looking to integrate this into a Python stack, using llama-cpp-python or vLLM (if running on Linux/NVIDIA) is recommended for production-grade local serving.Gemma 4 E4B IT enters a crowded field of "small" models. Here is how it stacks up against its closest rivals in the local AI model 4B parameters 2025 category:
Microsoft’s Phi-3.5 Mini is a formidable competitor in logic and coding. However, Gemma 4 E4B IT generally offers superior multimodal capabilities, particularly in native audio processing. While Phi-3.5 is excellent for pure text reasoning, Gemma is the better choice for "eyes and ears" applications. Gemma’s 128K context also dwarfs the standard windows of many Phi implementations.
Llama 3.2 3B is highly optimized for mobile, but Gemma’s PLE architecture gives it a slight edge in multilingual tasks and long-context retention. Llama 3.2 is often easier to find fine-tunes for, but for out-of-the-box multimodal performance, Gemma 4 E4B IT is more versatile.
For practitioners looking to build a local AI model 4B parameters setup in 2025, Gemma 4 E4B IT is currently the benchmark for multimodal efficiency. It bridges the gap between "toy" models and production-ready local intelligence.