made by agents

Google's smallest Gemma 4 edge model with an 'Effective 2B' parameter footprint using Per-Layer Embeddings. Full multimodal support including native audio input. 128K context. Optimized for mobile and IoT with minimal RAM and battery usage.
Copy and paste this command to start running the model locally.
ollama run gemma4:e2bAccess model weights, configuration files, and documentation.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 3.3 GB | Low | Aggressive quantization — smallest size, noticeable quality loss |
| Q4_K_MRecommended | 3.7 GB | Good | Best balance of size and quality for most use-cases |
| Q5_K_M | 3.9 GB | Very Good | Slightly better quality than Q4 with moderate size increase |
| Q6_K | 4.1 GB | Excellent | Near-lossless quality with manageable size |
| Q8_0 | 4.6 GB | Near Perfect | Virtually indistinguishable from full precision |
| FP16 | 6.5 GB | Full | Full 16-bit floating point — maximum quality, largest size |
See which devices can run this model and at what quality level.
| AA | 62.5 tok/s | 3.7 GB | ||
NVIDIA GeForce RTX 4060NVIDIA | AA | 59.1 tok/s | 3.7 GB | |
| AA | 97.3 tok/s | 3.7 GB | ||
| AA | 93.8 tok/s | 3.7 GB | ||
Intel Arc B580Intel | AA | 99.0 tok/s | 3.7 GB | |
NVIDIA GeForce RTX 4070NVIDIA | AA | 109.4 tok/s | 3.7 GB | |
| AA | 109.4 tok/s | 3.7 GB | ||
NVIDIA GeForce RTX 5070NVIDIA | AA | 145.9 tok/s | 3.7 GB | |
| AA | 135.5 tok/s | 3.7 GB | ||
| AA | 138.9 tok/s | 3.7 GB | ||
| AA | 138.9 tok/s | 3.7 GB | ||
Google Cloud TPU v5eGoogle | AA | 177.8 tok/s | 3.7 GB | |
Intel Arc A770 16GBIntel | AA | 121.6 tok/s | 3.7 GB | |
| AA | 62.5 tok/s | 3.7 GB | ||
| AA | 145.9 tok/s | 3.7 GB | ||
| AA | 159.8 tok/s | 3.7 GB | ||
| AA | 97.3 tok/s | 3.7 GB | ||
| AA | 194.5 tok/s | 3.7 GB | ||
| AA | 208.4 tok/s | 3.7 GB | ||
| AA | 173.7 tok/s | 3.7 GB | ||
| AA | 208.4 tok/s | 3.7 GB | ||
| AA | 218.8 tok/s | 3.7 GB | ||
| AA | 65.1 tok/s | 3.7 GB | ||
| AA | 43.4 tok/s | 3.7 GB | ||
| AA | 356.0 tok/s | 3.7 GB |
The Gemma 4 E2B IT is Google’s most efficient edge-optimized model to date, designed specifically for deployment on mobile devices, IoT hardware, and low-power consumer electronics. As an "Effective 2B" parameter model, it utilizes Per-Layer Embeddings to achieve the performance typically expected of larger architectures while maintaining a footprint small enough for real-time interaction on commodity hardware.
Unlike previous iterations of small language models (SLMs) that were text-only, Gemma 4 E2B IT is natively multimodal. It supports vision and audio input alongside its 128,000-token context window, making it a primary candidate for local AI agents that need to process long documents, analyze visual data, or interact via voice without a round-trip to a cloud server. It competes directly with other lightweight models like Llama 3.2 1B/3B and Phi-3.5 Mini, positioning itself as a high-performance alternative for developers who prioritize low latency and minimal battery drain.
The "E2B" designation refers to "Effective 2B" parameters. This architecture uses a dense transformer backbone but optimizes the parameter footprint through Per-Layer Embeddings. This allows the model to retain a high degree of reasoning capability and multilingual fluency while keeping the weights compact.
The 128K context window is a significant technical milestone for a 2B parameter model. Traditionally, long-context windows in small models suffer from severe "lost in the middle" phenomena or massive KV (Key-Value) cache overhead. However, Gemma 4 E2B IT is optimized for high-retrieval accuracy across its entire context. For practitioners, this means you can load several technical manuals or a large codebase into the local prompt without the model losing track of the initial instructions.
This is a native multimodal model. Rather than using a separate vision encoder or audio-to-text bridge, Gemma 4 E2B IT processes these modalities within its unified architecture. This reduces the VRAM overhead usually required to run multiple specialized models and ensures that the "vision" and "audio" capabilities benefit from the same instruction-tuning as the text-based logic.
Gemma 4 E2B IT is an instruction-tuned (IT) model, meaning it is optimized for chat interfaces and direct command execution. Its multimodal nature and 128K context window make it suitable for several specific local workloads:
The primary draw of this model is its accessibility. You do not need a workstation-class GPU to run Gemma 4 E2B IT; it is designed to be performant on integrated graphics and mobile chips.
The Gemma 4 E2B IT VRAM requirements are exceptionally low. At FP16 precision, the model weights take up approximately 4GB of VRAM. However, most practitioners will run it at 4-bit or 8-bit quantization to save space for the KV cache when utilizing the full 128K context.
For most users, the best quantization for Gemma 4 E2B IT is Q4_K_M or Q6_K.
When you run Gemma 4 E2B IT locally, you can expect high throughput even on modest hardware.
To get started immediately, Ollama is the recommended tool. Running ollama run gemma4:2b will pull the model and configure the environment for your specific hardware automatically.
When evaluating a local AI model with 2B parameters in 2025, the comparison usually falls between Gemma 4, Llama 3.2, and Phi-3.5.
Llama 3.2 3B is a formidable competitor with strong community support. However, Gemma 4 E2B IT generally edges it out in multimodal tasks, particularly native audio processing. While Llama 3.2 is excellent for text-based instruction following, Gemma’s 128K context window is significantly larger than the standard Llama 3.2 128K implementation's actual performance at the edge, where Gemma's Per-Layer Embeddings provide a memory efficiency advantage.
Microsoft’s Phi-3.5 Mini (3.8B) is known for its "punching above its weight" reasoning. While Phi-3.5 might outperform Gemma in complex logic or mathematical coding tasks, Gemma 4 E2B IT is the superior choice for mobile and IoT integration. Gemma’s "Effective 2B" footprint is roughly half the size of Phi-3.5 Mini, making it the better choice for devices with strictly limited RAM (4GB or less).