made by agents

Google's lightweight multimodal model that beats Gemma 2 27B. 128K context, 140+ languages, suitable for edge deployment.
Copy and paste this command to start running the model locally.
ollama run gemma3:4bAccess model weights, configuration files, and documentation.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 6.1 GB | Low | Aggressive quantization — smallest size, noticeable quality loss |
| Q4_K_MRecommended | 6.9 GB | Good | Best balance of size and quality for most use-cases |
| Q5_K_M | 7.3 GB | Very Good | Slightly better quality than Q4 with moderate size increase |
| Q6_K | 7.8 GB | Excellent | Near-lossless quality with manageable size |
| Q8_0 | 8.8 GB | Near Perfect | Virtually indistinguishable from full precision |
| FP16 | 12.6 GB | Full | Full 16-bit floating point — maximum quality, largest size |
See which devices can run this model and at what quality level.
| SS | 50.3 tok/s | 6.9 GB | ||
Intel Arc B580Intel | SS | 53.1 tok/s | 6.9 GB | |
NVIDIA GeForce RTX 4070NVIDIA | SS | 58.7 tok/s | 6.9 GB | |
| SS | 58.7 tok/s | 6.9 GB | ||
NVIDIA GeForce RTX 5070NVIDIA | SS | 78.2 tok/s | 6.9 GB | |
| SS | 72.6 tok/s | 6.9 GB | ||
| SS | 74.5 tok/s | 6.9 GB | ||
| SS | 74.5 tok/s | 6.9 GB | ||
Google Cloud TPU v5eGoogle | SS | 95.3 tok/s | 6.9 GB | |
Intel Arc A770 16GBIntel | SS | 65.2 tok/s | 6.9 GB | |
| SS | 78.2 tok/s | 6.9 GB | ||
| SS | 85.7 tok/s | 6.9 GB | ||
| SS | 52.1 tok/s | 6.9 GB | ||
| SS | 104.3 tok/s | 6.9 GB | ||
| SS | 111.7 tok/s | 6.9 GB | ||
| SS | 93.1 tok/s | 6.9 GB | ||
| AA | 111.7 tok/s | 6.9 GB | ||
| AA | 117.3 tok/s | 6.9 GB | ||
| AA | 190.9 tok/s | 6.9 GB | ||
| AA | 208.6 tok/s | 6.9 GB | ||
| AA | 33.5 tok/s | 6.9 GB | ||
NVIDIA L40SNVIDIA | AA | 100.6 tok/s | 6.9 GB | |
| AA | 111.7 tok/s | 6.9 GB | ||
| AA | 52.1 tok/s | 6.9 GB | ||
| AA | 46.6 tok/s | 6.9 GB |
Gemma 3 4B IT is Google’s high-efficiency, multimodal model designed specifically for edge deployment and local inference. Despite its small 4-billion parameter footprint, this model outperforms the previous generation Gemma 2 27B across several benchmarks, representing a significant leap in parameter efficiency. It is a dense, instruction-tuned (IT) model that handles both text and vision inputs, making it a versatile choice for developers who need to run Gemma 3 4B IT locally on consumer-grade hardware.
With a training cutoff of August 2024, Gemma 3 4B IT is more current than many of its contemporaries. It bridges the gap between ultra-lightweight models (like the 1B class) and mid-sized models (8B-14B), offering a massive 128,000-token context window that was previously reserved for much larger architectures. This makes it a primary candidate for local RAG (Retrieval-Augmented Generation) applications and long-form document analysis on devices with limited VRAM.
Gemma 3 4B IT utilizes a dense transformer architecture. Unlike Mixture-of-Experts (MoE) models that only activate a fraction of their parameters during inference, this 4B model is fully dense. This results in predictable memory usage and consistent performance across different types of prompts.
The most notable technical achievement in this model is the integration of native multimodal capabilities within a 4B parameter budget. It is trained to process both text and visual data natively, rather than relying on a separate vision encoder tacked onto a language model. This unified approach reduces latency and improves the model's ability to reason about the relationship between visual elements and textual instructions.
The 128k context length is a standout feature for a local AI model 4B parameters 2025 edition. Managing a context window of this size requires careful attention to KV (Key-Value) cache management. While the model parameters themselves require very little memory, filling the 128k context window will significantly increase the Gemma 3 4B IT VRAM requirements. For practitioners, this means you can feed the model entire technical manuals or long codebases, provided you have the VRAM to support the resulting KV cache.
Gemma 3 4B IT is not a general-purpose toy; it is a specialized tool for instruction-following, multilingual communication, and visual reasoning.
Because Gemma 3 4B IT is a text-and-vision model, it excels at local image processing. You can use it for:
The model supports over 140 languages, making it one of the most linguistically capable models in the sub-10B category. This is particularly useful for localized applications where data privacy is paramount and translation must happen on-device without hitting external APIs.
While not a dedicated "Coder" model like DeepSeek-V3 or CodeLlama, Gemma 3 4B IT is highly proficient in Python, JavaScript, C++, and Rust. Its instruction-following tuning makes it an excellent local co-pilot for:
The primary appeal of a 4B model is its accessibility. You do not need a data-center GPU to achieve high Gemma 3 4B IT tokens per second.
To run this model effectively, your primary bottleneck will be VRAM and memory bandwidth.
For most practitioners, Q4_K_M (4-bit) is the "sweet spot." It reduces the model size to roughly 2.5GB–3GB while maintaining over 95% of the performance of the full FP16 version. If you are using the model for precision tasks like coding or complex reasoning, upgrading to Q6_K or Q8_0 is recommended, as it minimizes the perplexity loss associated with lower bitrates.
On an RTX 30-series or 40-series GPU, you can expect:
On Apple Silicon (M2/M3/M4 Max), the model is exceptionally fast due to the unified memory architecture. An M4 Max can run this model at speeds that feel instantaneous, making it perfect for real-time chat applications.
The quickest way to get started is via Ollama. Once Ollama is installed, you can pull the model with a single command:
ollama run gemma3:4b
For developers building applications, using a local inference server like LM Studio or llama.cpp allows you to expose an OpenAI-compatible API endpoint, making it easy to swap Gemma 3 4B IT into existing workflows.
When evaluating Gemma 3 4B IT, it is most often compared against Llama 3.2 3B and Phi-4 (14B or mini variants).
Gemma 3 4B IT represents the current state-of-the-art for local AI model 4B parameters 2025. It is the first model of this size that truly feels "unconstrained," offering the context window and multimodal capabilities previously limited to massive cloud-based models. For developers building edge-AI applications, it is currently the model to beat.