Google

Gemma 4 E4B IT

Google's efficient edge model with an 'Effective 4B' parameter footprint using Per-Layer Embeddings (PLE). Supports text, image, video, and native audio input. 128K context. Designed for on-device deployment on phones, laptops, and IoT devices with near-zero latency.

4B paramsDense128K ctxMultimodal

View on Hugging Face

Run with Ollama Official Page

Our Take

Best for: Strongest at broad knowledge (MMLU-Pro) in its size class

A strong 4B-parameter dense language model from Google. Pulls ahead on broad knowledge (MMLU-Pro) (69/100), so reach for it when that's the dimension that matters.

Run this onAMD Radeon RX 7700 XTCheapest card in our directory with comfortable headroom (12 GB) for this model at Q4 (~6.9 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Vision

Reasoning

Multilingual

Instruction Following

Model Specifications

Parameters4B

ArchitectureDense

Context Length128K tokens

ModalityMultimodal

ProviderGoogle

Download Size48.1 GB

Community

Monthly Downloads6.3M

Likes1.1K

Last Updated4 days ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run gemma4:e4b

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Gemma Terms of UseView Full License

Performance & Scoring

Benchmarks

GPQA

58.6

MMLU-PRO

69.4

AIME 2026

42.5

Overall Score

70.7AA

Benchmark40%

56.8

Popularity25%

75.4

Efficiency20%

86.9

Versatility15%

78.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	6.1 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	6.9 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	7.3 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	7.8 GB	Excellent	Near-lossless quality with manageable size
Q8_0	8.8 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	12.6 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


AMD Radeon RX 7700 XTAMD	SS	50.3 tok/s	6.9 GB
Intel Arc B580Intel	SS	53.1 tok/s	6.9 GB
NVIDIA GeForce RTX 4070NVIDIA	SS	58.7 tok/s	6.9 GB
NVIDIA GeForce RTX 4070 SUPERNVIDIA	SS	58.7 tok/s	6.9 GB
NVIDIA GeForce RTX 5070NVIDIA	SS	78.2 tok/s	6.9 GB
ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	59.6 tok/s	6.9 GB
AMD Radeon RX 7800 XTAMD	SS	72.6 tok/s	6.9 GB
AMD Radeon RX 9070AMD	SS	74.5 tok/s	6.9 GB
AMD Radeon RX 9070 XTAMD	SS	74.5 tok/s	6.9 GB
Google Cloud TPU v5eGoogle	SS	95.3 tok/s	6.9 GB
Intel Arc A770 16GBIntel	SS	65.2 tok/s	6.9 GB
NOVATECH AI Workstation (i9-14900K + RTX 5080)NOVATECH	SS	111.7 tok/s	6.9 GB
NVIDIA GeForce RTX 4070 Ti SUPERNVIDIA	SS	78.2 tok/s	6.9 GB
NVIDIA GeForce RTX 4080 SUPERNVIDIA	SS	85.7 tok/s	6.9 GB
NVIDIA GeForce RTX 5060 Ti 16GBNVIDIA	SS	52.1 tok/s	6.9 GB
NVIDIA GeForce RTX 5070 TiNVIDIA	SS	104.3 tok/s	6.9 GB
NVIDIA GeForce RTX 5080 Founders EditionNVIDIA	SS	111.7 tok/s	6.9 GB
AMD Radeon RX 7900 XTAMD	SS	93.1 tok/s	6.9 GB
AMD Radeon RX 7900 XTXAMD	AA	111.7 tok/s	6.9 GB
NVIDIA GeForce RTX 3090NVIDIA	AA	108.9 tok/s	6.9 GB
NVIDIA GeForce RTX 4090 Founders EditionNVIDIA	AA	117.3 tok/s	6.9 GB
Google Cloud TPU v6e (Trillium)Google	AA	190.9 tok/s	6.9 GB
NVIDIA GeForce RTX 5090 Founders EditionNVIDIA	AA	208.6 tok/s	6.9 GB
Origin PC M-CLASS v2Origin PC	AA	208.6 tok/s	6.9 GB
NVIDIA GeForce RTX 4060 Ti 16GBNVIDIA	AA	33.5 tok/s	6.9 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on AMD Radeon RX 7600 8GB (~34 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Gemma 4 E4B IT on AMD Radeon RX 7600 8GB · ~34 tok/s · 165W	$0.164
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 7 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA GeForce RTX 3080Vast.ai · Spot · 10 GB VRAM	$0.04
NVIDIA GeForce RTX 3080Vast.ai · On-Demand · 10 GB VRAM	$0.05
NVIDIA GeForce RTX 3090Vast.ai · Spot · 24 GB VRAM	$0.05
NVIDIA GeForce RTX 3090Vast.ai · On-Demand · 24 GB VRAM	$0.07
NVIDIA GeForce RTX 5070Vast.ai · Spot · 12 GB VRAM	$0.08

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Gemma 4 E4B IT is Google’s high-efficiency "Effective 4B" multimodal model, designed specifically for edge deployment. Unlike previous generations that relied on standard dense scaling, this model utilizes Per-Layer Embeddings (PLE) to compress its footprint while maintaining the reasoning capabilities of much larger architectures. It is a native multimodal model, meaning it processes text, images, video, and audio through a unified architecture rather than relying on separate adapter modules.

For developers and engineers, Gemma 4 E4B IT represents a shift toward "near-zero latency" local AI. It competes directly with models like Phi-3.5 Mini and Llama 3.2 3B, offering a significantly larger 128,000-token context window and native audio processing that is often missing in the sub-8B parameter category. This makes it a primary candidate for local agents, real-time voice assistants, and on-device document analysis where data privacy and latency are the primary constraints.

Architecture & Technical Details

The "Effective 4B" designation refers to a dense architecture that optimizes parameter distribution through PLE. By using specialized embedding layers, the model reduces the memory overhead traditionally associated with high-vocabulary, multilingual models. This architectural choice directly impacts Gemma 4 E4B IT performance, allowing it to fit into the memory constraints of mobile devices and entry-level laptops without the massive perplexity hits usually seen in aggressive pruning.

The standout technical feature is the 128,000 token context length. For a 4B parameter model, this is an outlier. Most models in this weight class struggle with "needle-in-a-haystack" retrieval beyond 8K or 32K tokens. Gemma 4 E4B IT is engineered to maintain coherence over long-form inputs, making it viable for analyzing entire codebases or long PDF documents locally. As a dense model, every parameter is active during inference, providing consistent throughput and predictable Gemma 4 E4B IT VRAM requirements.

Capabilities & Use Cases

Gemma 4 E4B IT is an instruction-tuned (IT) model, meaning it is optimized for conversational dialogue and following complex system prompts. Its multimodality is native, which is critical for practitioners building integrated applications.

Multimodal Reasoning: Because it supports image, video, and native audio input, you can use it for real-time surveillance analysis, automated video captioning, or voice-controlled interfaces without an intermediate Speech-to-Text (STT) layer.
Local Coding Assistant: While it won't replace a 70B model for complex architectural design, Gemma 4 E4B IT for coding is highly effective for unit test generation, boilerplate expansion, and real-time autocomplete within an IDE. It understands Python, JavaScript, C++, and Go with high proficiency.
Multilingual Document Processing: The model excels at translating and summarizing content across major global languages. Its 128K context allows you to feed it multiple long-form documents for cross-referencing and synthesis.
Edge Reasoning: On Gemma 4 E4B IT reasoning benchmarks, the model punches above its weight class in logic and math, making it suitable for local RAG (Retrieval-Augmented Generation) where the model must decide which parts of a retrieved context are relevant to the query.

Running Gemma 4 E4B IT Locally

To run Gemma 4 E4B IT locally, your primary constraint will be VRAM. Because this is a 4B parameter model, it is exceptionally accessible for consumer-grade hardware.

Hardware Requirements & VRAM

The Gemma 4 E4B IT hardware requirements vary based on your choice of quantization. At 4-bit quantization, the model is small enough to run on almost any modern smartphone or laptop.

Minimum (Q4_K_M Quantization): ~3.5 GB VRAM. This allows the model to run on an NVIDIA RTX 3050 (4GB) or an entry-level MacBook Air with 8GB of Unified Memory.
Recommended (Q8_0 or FP16): 6 GB to 10 GB VRAM. This provides the highest precision and is easily handled by an RTX 3060 or 4060.
Best GPU for Gemma 4 E4B IT: For maximum throughput, an RTX 4090 or Apple M4 Max is overkill for the model size but will result in staggering token speeds, often exceeding 150+ tokens per second.

Performance and Quantization

When deciding on the best quantization for Gemma 4 E4B IT, the standard recommendation for most practitioners is Q4_K_M. This provides a 4.5-bit effective precision that retains nearly all the logic of the FP16 model while cutting the memory footprint by more than half.

Gemma 4 E4B IT tokens per second: On an Apple M2/M3 chip, expect 40-70 t/s. On a dedicated NVIDIA 40-series GPU, you will likely hit the limits of your inference engine before the hardware, often seeing 100+ t/s.
How to run 4B model on consumer GPU: The most efficient way to deploy this model is via Ollama. Running ollama run gemma4:4b will automatically pull the optimized manifest and handle the memory allocation for your specific GPU. For developers looking to integrate this into a Python stack, using llama-cpp-python or vLLM (if running on Linux/NVIDIA) is recommended for production-grade local serving.

How It Compares

Gemma 4 E4B IT enters a crowded field of "small" models. Here is how it stacks up against its closest rivals in the local AI model 4B parameters 2025 category:

Gemma 4 E4B IT vs. Phi-3.5 Mini (3.8B)

Microsoft’s Phi-3.5 Mini is a formidable competitor in logic and coding. However, Gemma 4 E4B IT generally offers superior multimodal capabilities, particularly in native audio processing. While Phi-3.5 is excellent for pure text reasoning, Gemma is the better choice for "eyes and ears" applications. Gemma’s 128K context also dwarfs the standard windows of many Phi implementations.

Gemma 4 E4B IT vs. Llama 3.2 3B

Llama 3.2 3B is highly optimized for mobile, but Gemma’s PLE architecture gives it a slight edge in multilingual tasks and long-context retention. Llama 3.2 is often easier to find fine-tunes for, but for out-of-the-box multimodal performance, Gemma 4 E4B IT is more versatile.

Tradeoffs

Pros: Massive 128K context; native audio/video support; extremely low VRAM footprint; high-speed inference on consumer hardware.
Cons: The Gemma Terms of Use are more restrictive than Apache 2.0 or MIT licenses; 4B parameters may still struggle with highly nuanced creative writing compared to 8B+ models.

For practitioners looking to build a local AI model 4B parameters setup in 2025, Gemma 4 E4B IT is currently the benchmark for multimodal efficiency. It bridges the gap between "toy" models and production-ready local intelligence.

Related Models

Google

Explore the Provider

See all Google models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Google model we track.

Open Google

Explore the Family

See every Gemma release

The full Gemma family leaderboard with sizes, benchmark scores, and a release timeline.

Open Gemma

Find the Best Hardware for This Model

Use our hardware calculator to find the optimal device for running this model.

Google

Gemma 4 E4B IT

4B paramsDense128K ctxMultimodal

View on Hugging Face

Run with Ollama Official Page

Our Take

Best for: Strongest at broad knowledge (MMLU-Pro) in its size class

A strong 4B-parameter dense language model from Google. Pulls ahead on broad knowledge (MMLU-Pro) (69/100), so reach for it when that's the dimension that matters.

Run this onAMD Radeon RX 7700 XTCheapest card in our directory with comfortable headroom (12 GB) for this model at Q4 (~6.9 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Vision

Reasoning

Multilingual

Instruction Following

Model Specifications

Parameters4B

ArchitectureDense

Context Length128K tokens

ModalityMultimodal

ProviderGoogle

Download Size48.1 GB

Community

Monthly Downloads6.3M

Likes1.1K

Last Updated4 days ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run gemma4:e4b

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Gemma Terms of UseView Full License

Performance & Scoring

Benchmarks

GPQA

58.6

MMLU-PRO

69.4

AIME 2026

42.5

Overall Score

70.7AA

Benchmark40%

56.8

Popularity25%

75.4

Efficiency20%

86.9

Versatility15%

78.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	6.1 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	6.9 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	7.3 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	7.8 GB	Excellent	Near-lossless quality with manageable size
Q8_0	8.8 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	12.6 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


AMD Radeon RX 7700 XTAMD	SS	50.3 tok/s	6.9 GB
Intel Arc B580Intel	SS	53.1 tok/s	6.9 GB
NVIDIA GeForce RTX 4070NVIDIA	SS	58.7 tok/s	6.9 GB
NVIDIA GeForce RTX 4070 SUPERNVIDIA	SS	58.7 tok/s	6.9 GB
NVIDIA GeForce RTX 5070NVIDIA	SS	78.2 tok/s	6.9 GB
ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	59.6 tok/s	6.9 GB
AMD Radeon RX 7800 XTAMD	SS	72.6 tok/s	6.9 GB
AMD Radeon RX 9070AMD	SS	74.5 tok/s	6.9 GB
AMD Radeon RX 9070 XTAMD	SS	74.5 tok/s	6.9 GB
Google Cloud TPU v5eGoogle	SS	95.3 tok/s	6.9 GB
Intel Arc A770 16GBIntel	SS	65.2 tok/s	6.9 GB
NOVATECH AI Workstation (i9-14900K + RTX 5080)NOVATECH	SS	111.7 tok/s	6.9 GB
NVIDIA GeForce RTX 4070 Ti SUPERNVIDIA	SS	78.2 tok/s	6.9 GB
NVIDIA GeForce RTX 4080 SUPERNVIDIA	SS	85.7 tok/s	6.9 GB
NVIDIA GeForce RTX 5060 Ti 16GBNVIDIA	SS	52.1 tok/s	6.9 GB
NVIDIA GeForce RTX 5070 TiNVIDIA	SS	104.3 tok/s	6.9 GB
NVIDIA GeForce RTX 5080 Founders EditionNVIDIA	SS	111.7 tok/s	6.9 GB
AMD Radeon RX 7900 XTAMD	SS	93.1 tok/s	6.9 GB
AMD Radeon RX 7900 XTXAMD	AA	111.7 tok/s	6.9 GB
NVIDIA GeForce RTX 3090NVIDIA	AA	108.9 tok/s	6.9 GB
NVIDIA GeForce RTX 4090 Founders EditionNVIDIA	AA	117.3 tok/s	6.9 GB
Google Cloud TPU v6e (Trillium)Google	AA	190.9 tok/s	6.9 GB
NVIDIA GeForce RTX 5090 Founders EditionNVIDIA	AA	208.6 tok/s	6.9 GB
Origin PC M-CLASS v2Origin PC	AA	208.6 tok/s	6.9 GB
NVIDIA GeForce RTX 4060 Ti 16GBNVIDIA	AA	33.5 tok/s	6.9 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on AMD Radeon RX 7600 8GB (~34 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Gemma 4 E4B IT on AMD Radeon RX 7600 8GB · ~34 tok/s · 165W	$0.164
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 7 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA GeForce RTX 3080Vast.ai · Spot · 10 GB VRAM	$0.04
NVIDIA GeForce RTX 3080Vast.ai · On-Demand · 10 GB VRAM	$0.05
NVIDIA GeForce RTX 3090Vast.ai · Spot · 24 GB VRAM	$0.05
NVIDIA GeForce RTX 3090Vast.ai · On-Demand · 24 GB VRAM	$0.07
NVIDIA GeForce RTX 5070Vast.ai · Spot · 12 GB VRAM	$0.08

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Architecture & Technical Details

Capabilities & Use Cases

Multimodal Reasoning: Because it supports image, video, and native audio input, you can use it for real-time surveillance analysis, automated video captioning, or voice-controlled interfaces without an intermediate Speech-to-Text (STT) layer.
Local Coding Assistant: While it won't replace a 70B model for complex architectural design, Gemma 4 E4B IT for coding is highly effective for unit test generation, boilerplate expansion, and real-time autocomplete within an IDE. It understands Python, JavaScript, C++, and Go with high proficiency.
Multilingual Document Processing: The model excels at translating and summarizing content across major global languages. Its 128K context allows you to feed it multiple long-form documents for cross-referencing and synthesis.
Edge Reasoning: On Gemma 4 E4B IT reasoning benchmarks, the model punches above its weight class in logic and math, making it suitable for local RAG (Retrieval-Augmented Generation) where the model must decide which parts of a retrieved context are relevant to the query.

Running Gemma 4 E4B IT Locally

To run Gemma 4 E4B IT locally, your primary constraint will be VRAM. Because this is a 4B parameter model, it is exceptionally accessible for consumer-grade hardware.

Hardware Requirements & VRAM

The Gemma 4 E4B IT hardware requirements vary based on your choice of quantization. At 4-bit quantization, the model is small enough to run on almost any modern smartphone or laptop.

Minimum (Q4_K_M Quantization): ~3.5 GB VRAM. This allows the model to run on an NVIDIA RTX 3050 (4GB) or an entry-level MacBook Air with 8GB of Unified Memory.
Recommended (Q8_0 or FP16): 6 GB to 10 GB VRAM. This provides the highest precision and is easily handled by an RTX 3060 or 4060.
Best GPU for Gemma 4 E4B IT: For maximum throughput, an RTX 4090 or Apple M4 Max is overkill for the model size but will result in staggering token speeds, often exceeding 150+ tokens per second.

Performance and Quantization

Gemma 4 E4B IT tokens per second: On an Apple M2/M3 chip, expect 40-70 t/s. On a dedicated NVIDIA 40-series GPU, you will likely hit the limits of your inference engine before the hardware, often seeing 100+ t/s.
How to run 4B model on consumer GPU: The most efficient way to deploy this model is via Ollama. Running ollama run gemma4:4b will automatically pull the optimized manifest and handle the memory allocation for your specific GPU. For developers looking to integrate this into a Python stack, using llama-cpp-python or vLLM (if running on Linux/NVIDIA) is recommended for production-grade local serving.

How It Compares

Gemma 4 E4B IT enters a crowded field of "small" models. Here is how it stacks up against its closest rivals in the local AI model 4B parameters 2025 category:

Gemma 4 E4B IT vs. Phi-3.5 Mini (3.8B)

Gemma 4 E4B IT vs. Llama 3.2 3B

Tradeoffs

Pros: Massive 128K context; native audio/video support; extremely low VRAM footprint; high-speed inference on consumer hardware.
Cons: The Gemma Terms of Use are more restrictive than Apache 2.0 or MIT licenses; 4B parameters may still struggle with highly nuanced creative writing compared to 8B+ models.