Google

Gemma 3 4B IT

Google's lightweight multimodal model that beats Gemma 2 27B. 128K context, 140+ languages, suitable for edge deployment.

4B paramsDense128K ctxText + Vision

View on Hugging Face

Run with Ollama Official Page

Our Take

Best for: Local inference under 16 GB VRAM

A strong 4B-parameter dense language model from Google. High composite score across our benchmark mix — worth shortlisting when raw quality matters more than VRAM budget.

Run this onAMD Radeon RX 7700 XTCheapest card in our directory with comfortable headroom (12 GB) for this model at Q4 (~6.9 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Vision

Multilingual

Instruction Following

Model Specifications

Parameters4B

ArchitectureDense

Context Length128K tokens

ModalityText + Vision

Training CutoffAugust 2024

ProviderGoogle

Download Size50.5 GB

Community

Monthly Downloads2.0M

Likes1.3K

Last Updated1 years ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run gemma3:4b

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Gemma Terms of UseView Full License

Performance & Scoring

Benchmarks

Arena Score

68.6

Overall Score

71.9AA

Benchmark40%

68.6

Popularity25%

68.0

Efficiency20%

90.2

Versatility15%

62.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	6.1 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	6.9 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	7.3 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	7.8 GB	Excellent	Near-lossless quality with manageable size
Q8_0	8.8 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	12.6 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


AMD Radeon RX 7700 XTAMD	SS	50.3 tok/s	6.9 GB
Intel Arc B580Intel	SS	53.1 tok/s	6.9 GB
NVIDIA GeForce RTX 4070NVIDIA	SS	58.7 tok/s	6.9 GB
NVIDIA GeForce RTX 4070 SUPERNVIDIA	SS	58.7 tok/s	6.9 GB
NVIDIA GeForce RTX 5070NVIDIA	SS	78.2 tok/s	6.9 GB
ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	59.6 tok/s	6.9 GB
AMD Radeon RX 7800 XTAMD	SS	72.6 tok/s	6.9 GB
AMD Radeon RX 9070AMD	SS	74.5 tok/s	6.9 GB
AMD Radeon RX 9070 XTAMD	SS	74.5 tok/s	6.9 GB
Google Cloud TPU v5eGoogle	SS	95.3 tok/s	6.9 GB
Intel Arc A770 16GBIntel	SS	65.2 tok/s	6.9 GB
NOVATECH AI Workstation (i9-14900K + RTX 5080)NOVATECH	SS	111.7 tok/s	6.9 GB
NVIDIA GeForce RTX 4070 Ti SUPERNVIDIA	SS	78.2 tok/s	6.9 GB
NVIDIA GeForce RTX 4080 SUPERNVIDIA	SS	85.7 tok/s	6.9 GB
NVIDIA GeForce RTX 5060 Ti 16GBNVIDIA	SS	52.1 tok/s	6.9 GB
NVIDIA GeForce RTX 5070 TiNVIDIA	SS	104.3 tok/s	6.9 GB
NVIDIA GeForce RTX 5080 Founders EditionNVIDIA	SS	111.7 tok/s	6.9 GB
AMD Radeon RX 7900 XTAMD	SS	93.1 tok/s	6.9 GB
AMD Radeon RX 7900 XTXAMD	AA	111.7 tok/s	6.9 GB
NVIDIA GeForce RTX 3090NVIDIA	AA	108.9 tok/s	6.9 GB
NVIDIA GeForce RTX 4090 Founders EditionNVIDIA	AA	117.3 tok/s	6.9 GB
Google Cloud TPU v6e (Trillium)Google	AA	190.9 tok/s	6.9 GB
NVIDIA GeForce RTX 5090 Founders EditionNVIDIA	AA	208.6 tok/s	6.9 GB
Origin PC M-CLASS v2Origin PC	AA	208.6 tok/s	6.9 GB
NVIDIA GeForce RTX 4060 Ti 16GBNVIDIA	AA	33.5 tok/s	6.9 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on AMD Radeon RX 7600 8GB (~34 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Gemma 3 4B IT on AMD Radeon RX 7600 8GB · ~34 tok/s · 165W	$0.164
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 7 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA GeForce RTX 3080Vast.ai · Spot · 10 GB VRAM	$0.04
NVIDIA GeForce RTX 3080Vast.ai · On-Demand · 10 GB VRAM	$0.05
NVIDIA GeForce RTX 3090Vast.ai · Spot · 24 GB VRAM	$0.05
NVIDIA GeForce RTX 3090Vast.ai · On-Demand · 24 GB VRAM	$0.07
NVIDIA GeForce RTX 5070Vast.ai · Spot · 12 GB VRAM	$0.08

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Gemma 3 4B IT is Google’s high-efficiency, multimodal model designed specifically for edge deployment and local inference. Despite its small 4-billion parameter footprint, this model outperforms the previous generation Gemma 2 27B across several benchmarks, representing a significant leap in parameter efficiency. It is a dense, instruction-tuned (IT) model that handles both text and vision inputs, making it a versatile choice for developers who need to run Gemma 3 4B IT locally on consumer-grade hardware.

With a training cutoff of August 2024, Gemma 3 4B IT is more current than many of its contemporaries. It bridges the gap between ultra-lightweight models (like the 1B class) and mid-sized models (8B-14B), offering a massive 128,000-token context window that was previously reserved for much larger architectures. This makes it a primary candidate for local RAG (Retrieval-Augmented Generation) applications and long-form document analysis on devices with limited VRAM.

Architecture & Technical Details

Gemma 3 4B IT utilizes a dense transformer architecture. Unlike Mixture-of-Experts (MoE) models that only activate a fraction of their parameters during inference, this 4B model is fully dense. This results in predictable memory usage and consistent performance across different types of prompts.

The most notable technical achievement in this model is the integration of native multimodal capabilities within a 4B parameter budget. It is trained to process both text and visual data natively, rather than relying on a separate vision encoder tacked onto a language model. This unified approach reduces latency and improves the model's ability to reason about the relationship between visual elements and textual instructions.

The 128k context length is a standout feature for a local AI model 4B parameters 2025 edition. Managing a context window of this size requires careful attention to KV (Key-Value) cache management. While the model parameters themselves require very little memory, filling the 128k context window will significantly increase the Gemma 3 4B IT VRAM requirements. For practitioners, this means you can feed the model entire technical manuals or long codebases, provided you have the VRAM to support the resulting KV cache.

Capabilities & Use Cases

Gemma 3 4B IT is not a general-purpose toy; it is a specialized tool for instruction-following, multilingual communication, and visual reasoning.

Multimodal Vision & OCR

Because Gemma 3 4B IT is a text-and-vision model, it excels at local image processing. You can use it for:

Document Parsing: Extracting structured data from screenshots or PDFs.
UI Automation: Describing interface elements for testing scripts.
Accessibility Tools: Providing real-time descriptions of visual environments for edge-based assistive tech.

Multilingual Support

The model supports over 140 languages, making it one of the most linguistically capable models in the sub-10B category. This is particularly useful for localized applications where data privacy is paramount and translation must happen on-device without hitting external APIs.

Gemma 3 4B IT for Coding

While not a dedicated "Coder" model like DeepSeek-V3 or CodeLlama, Gemma 3 4B IT is highly proficient in Python, JavaScript, C++, and Rust. Its instruction-following tuning makes it an excellent local co-pilot for:

Unit Test Generation: Writing tests for existing functions.
Code Explanation: Summarizing what a specific block of code does.
Refactoring: Suggesting more idiomatic ways to write a function within a local IDE extension.

Running Gemma 3 4B IT Locally

The primary appeal of a 4B model is its accessibility. You do not need a data-center GPU to achieve high Gemma 3 4B IT tokens per second.

Gemma 3 4B IT Hardware Requirements

To run this model effectively, your primary bottleneck will be VRAM and memory bandwidth.

Minimum (4-bit Quantization): 4GB VRAM. This allows the model to run on entry-level GPUs like the RTX 3050 or older GTX 10-series cards.
Recommended (8-bit or FP16): 8GB to 12GB VRAM. This provides enough headroom for the model and a substantial portion of the 128k context window.
Best GPU for Gemma 3 4B IT: For a seamless experience, an NVIDIA RTX 4060 (8GB) or RTX 4060 Ti (16GB) is ideal. The 16GB version of the 4060 Ti is particularly well-suited if you plan to utilize the full 128k context window, as the KV cache will consume more memory than the model itself.

Best Quantization for Gemma 3 4B IT

For most practitioners, Q4_K_M (4-bit) is the "sweet spot." It reduces the model size to roughly 2.5GB–3GB while maintaining over 95% of the performance of the full FP16 version. If you are using the model for precision tasks like coding or complex reasoning, upgrading to Q6_K or Q8_0 is recommended, as it minimizes the perplexity loss associated with lower bitrates.

Performance Expectations

On an RTX 30-series or 40-series GPU, you can expect:

Prompt Processing: 500+ tokens per second.
Token Generation: 80–130 tokens per second (depending on quantization and hardware).

On Apple Silicon (M2/M3/M4 Max), the model is exceptionally fast due to the unified memory architecture. An M4 Max can run this model at speeds that feel instantaneous, making it perfect for real-time chat applications.

How to Run 4B Model on Consumer GPU

The quickest way to get started is via Ollama. Once Ollama is installed, you can pull the model with a single command:

ollama run gemma3:4b

For developers building applications, using a local inference server like LM Studio or llama.cpp allows you to expose an OpenAI-compatible API endpoint, making it easy to swap Gemma 3 4B IT into existing workflows.

How It Compares

When evaluating Gemma 3 4B IT, it is most often compared against Llama 3.2 3B and Phi-4 (14B or mini variants).

Gemma 3 4B IT vs Llama 3.2 3B

Context: Gemma 3 wins decisively with 128k context compared to Llama 3.2’s 128k (though Llama's performance often degrades faster at the tail end of the window in smaller parameter sizes).
Multimodality: Gemma 3 4B IT has native vision support, whereas Llama 3.2 3B is text-only (vision requires the larger 11B/90B versions).
Performance: Gemma 3 4B generally shows higher reasoning capabilities in multilingual benchmarks.

Gemma 3 4B IT vs Phi-4

Size: Phi-4 is often larger (around 14B parameters), requiring significantly more VRAM (approx. 10GB+ for 4-bit). Gemma 3 4B IT is much easier to fit onto mobile devices or 8GB laptops.
Speed: Due to its lower parameter count, Gemma 3 4B IT will almost always offer higher tokens per second on the same hardware compared to Phi-4.
Logic: Phi-4 may hold an edge in complex mathematical reasoning, but for general instruction-following and vision tasks, Gemma 3 4B IT is more efficient.

Gemma 3 4B IT represents the current state-of-the-art for local AI model 4B parameters 2025. It is the first model of this size that truly feels "unconstrained," offering the context window and multimodal capabilities previously limited to massive cloud-based models. For developers building edge-AI applications, it is currently the model to beat.

Related Models

Google

Explore the Provider

See all Google models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Google model we track.

Open Google

Explore the Family

See every Gemma release

The full Gemma family leaderboard with sizes, benchmark scores, and a release timeline.

Open Gemma

Find the Best Hardware for This Model

Use our hardware calculator to find the optimal device for running this model.

Google

Gemma 3 4B IT

Google's lightweight multimodal model that beats Gemma 2 27B. 128K context, 140+ languages, suitable for edge deployment.

4B paramsDense128K ctxText + Vision

View on Hugging Face

Run with Ollama Official Page

Our Take

Best for: Local inference under 16 GB VRAM

A strong 4B-parameter dense language model from Google. High composite score across our benchmark mix — worth shortlisting when raw quality matters more than VRAM budget.

Run this onAMD Radeon RX 7700 XTCheapest card in our directory with comfortable headroom (12 GB) for this model at Q4 (~6.9 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Vision

Multilingual

Instruction Following

Model Specifications

Parameters4B

ArchitectureDense

Context Length128K tokens

ModalityText + Vision

Training CutoffAugust 2024

ProviderGoogle

Download Size50.5 GB

Community

Monthly Downloads2.0M

Likes1.3K

Last Updated1 years ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run gemma3:4b

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Gemma Terms of UseView Full License

Performance & Scoring

Benchmarks

Arena Score

68.6

Overall Score

71.9AA

Benchmark40%

68.6

Popularity25%

68.0

Efficiency20%

90.2

Versatility15%

62.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	6.1 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	6.9 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	7.3 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	7.8 GB	Excellent	Near-lossless quality with manageable size
Q8_0	8.8 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	12.6 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


AMD Radeon RX 7700 XTAMD	SS	50.3 tok/s	6.9 GB
Intel Arc B580Intel	SS	53.1 tok/s	6.9 GB
NVIDIA GeForce RTX 4070NVIDIA	SS	58.7 tok/s	6.9 GB
NVIDIA GeForce RTX 4070 SUPERNVIDIA	SS	58.7 tok/s	6.9 GB
NVIDIA GeForce RTX 5070NVIDIA	SS	78.2 tok/s	6.9 GB
ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	59.6 tok/s	6.9 GB
AMD Radeon RX 7800 XTAMD	SS	72.6 tok/s	6.9 GB
AMD Radeon RX 9070AMD	SS	74.5 tok/s	6.9 GB
AMD Radeon RX 9070 XTAMD	SS	74.5 tok/s	6.9 GB
Google Cloud TPU v5eGoogle	SS	95.3 tok/s	6.9 GB
Intel Arc A770 16GBIntel	SS	65.2 tok/s	6.9 GB
NOVATECH AI Workstation (i9-14900K + RTX 5080)NOVATECH	SS	111.7 tok/s	6.9 GB
NVIDIA GeForce RTX 4070 Ti SUPERNVIDIA	SS	78.2 tok/s	6.9 GB
NVIDIA GeForce RTX 4080 SUPERNVIDIA	SS	85.7 tok/s	6.9 GB
NVIDIA GeForce RTX 5060 Ti 16GBNVIDIA	SS	52.1 tok/s	6.9 GB
NVIDIA GeForce RTX 5070 TiNVIDIA	SS	104.3 tok/s	6.9 GB
NVIDIA GeForce RTX 5080 Founders EditionNVIDIA	SS	111.7 tok/s	6.9 GB
AMD Radeon RX 7900 XTAMD	SS	93.1 tok/s	6.9 GB
AMD Radeon RX 7900 XTXAMD	AA	111.7 tok/s	6.9 GB
NVIDIA GeForce RTX 3090NVIDIA	AA	108.9 tok/s	6.9 GB
NVIDIA GeForce RTX 4090 Founders EditionNVIDIA	AA	117.3 tok/s	6.9 GB
Google Cloud TPU v6e (Trillium)Google	AA	190.9 tok/s	6.9 GB
NVIDIA GeForce RTX 5090 Founders EditionNVIDIA	AA	208.6 tok/s	6.9 GB
Origin PC M-CLASS v2Origin PC	AA	208.6 tok/s	6.9 GB
NVIDIA GeForce RTX 4060 Ti 16GBNVIDIA	AA	33.5 tok/s	6.9 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on AMD Radeon RX 7600 8GB (~34 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Gemma 3 4B IT on AMD Radeon RX 7600 8GB · ~34 tok/s · 165W	$0.164
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 7 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA GeForce RTX 3080Vast.ai · Spot · 10 GB VRAM	$0.04
NVIDIA GeForce RTX 3080Vast.ai · On-Demand · 10 GB VRAM	$0.05
NVIDIA GeForce RTX 3090Vast.ai · Spot · 24 GB VRAM	$0.05
NVIDIA GeForce RTX 3090Vast.ai · On-Demand · 24 GB VRAM	$0.07
NVIDIA GeForce RTX 5070Vast.ai · Spot · 12 GB VRAM	$0.08

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Architecture & Technical Details

Capabilities & Use Cases

Gemma 3 4B IT is not a general-purpose toy; it is a specialized tool for instruction-following, multilingual communication, and visual reasoning.

Multimodal Vision & OCR

Because Gemma 3 4B IT is a text-and-vision model, it excels at local image processing. You can use it for:

Document Parsing: Extracting structured data from screenshots or PDFs.
UI Automation: Describing interface elements for testing scripts.
Accessibility Tools: Providing real-time descriptions of visual environments for edge-based assistive tech.

Multilingual Support

Gemma 3 4B IT for Coding

Unit Test Generation: Writing tests for existing functions.
Code Explanation: Summarizing what a specific block of code does.
Refactoring: Suggesting more idiomatic ways to write a function within a local IDE extension.

Running Gemma 3 4B IT Locally

The primary appeal of a 4B model is its accessibility. You do not need a data-center GPU to achieve high Gemma 3 4B IT tokens per second.

Gemma 3 4B IT Hardware Requirements

To run this model effectively, your primary bottleneck will be VRAM and memory bandwidth.

Minimum (4-bit Quantization): 4GB VRAM. This allows the model to run on entry-level GPUs like the RTX 3050 or older GTX 10-series cards.
Recommended (8-bit or FP16): 8GB to 12GB VRAM. This provides enough headroom for the model and a substantial portion of the 128k context window.
Best GPU for Gemma 3 4B IT: For a seamless experience, an NVIDIA RTX 4060 (8GB) or RTX 4060 Ti (16GB) is ideal. The 16GB version of the 4060 Ti is particularly well-suited if you plan to utilize the full 128k context window, as the KV cache will consume more memory than the model itself.

Best Quantization for Gemma 3 4B IT

Performance Expectations

On an RTX 30-series or 40-series GPU, you can expect:

Prompt Processing: 500+ tokens per second.
Token Generation: 80–130 tokens per second (depending on quantization and hardware).

How to Run 4B Model on Consumer GPU

The quickest way to get started is via Ollama. Once Ollama is installed, you can pull the model with a single command:

ollama run gemma3:4b

How It Compares

When evaluating Gemma 3 4B IT, it is most often compared against Llama 3.2 3B and Phi-4 (14B or mini variants).

Gemma 3 4B IT vs Llama 3.2 3B

Context: Gemma 3 wins decisively with 128k context compared to Llama 3.2’s 128k (though Llama's performance often degrades faster at the tail end of the window in smaller parameter sizes).
Multimodality: Gemma 3 4B IT has native vision support, whereas Llama 3.2 3B is text-only (vision requires the larger 11B/90B versions).
Performance: Gemma 3 4B generally shows higher reasoning capabilities in multilingual benchmarks.

Gemma 3 4B IT vs Phi-4

Size: Phi-4 is often larger (around 14B parameters), requiring significantly more VRAM (approx. 10GB+ for 4-bit). Gemma 3 4B IT is much easier to fit onto mobile devices or 8GB laptops.
Speed: Due to its lower parameter count, Gemma 3 4B IT will almost always offer higher tokens per second on the same hardware compared to Phi-4.
Logic: Phi-4 may hold an edge in complex mathematical reasoning, but for general instruction-following and vision tasks, Gemma 3 4B IT is more efficient.