Google

Gemma 4 E2B IT

Google's smallest Gemma 4 edge model with an 'Effective 2B' parameter footprint using Per-Layer Embeddings. Full multimodal support including native audio input. 128K context. Optimized for mobile and IoT with minimal RAM and battery usage.

2B paramsDense128K ctxMultimodal

View on Hugging Face

Run with Ollama Official Page

Our Take

Best for: Strongest at broad knowledge (MMLU-Pro) in its size class

A solid 2B-parameter dense language model from Google. Pulls ahead on broad knowledge (MMLU-Pro) (60/100), so reach for it when that's the dimension that matters.

Run this onAMD Radeon RX 7600 8GBCheapest card in our directory with comfortable headroom (8 GB) for this model at Q4 (~3.7 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Vision

Multilingual

Instruction Following

Model Specifications

Parameters2B

ArchitectureDense

Context Length128K tokens

ModalityMultimodal

ProviderGoogle

Download Size30.8 GB

Community

Monthly Downloads3.3M

Likes638

Last Updated4 days ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run gemma4:e2b

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Gemma Terms of UseView Full License

Performance & Scoring

Benchmarks

GPQA

43.4

MMLU-PRO

60.0

AIME 2026

37.5

Overall Score

61.3BB

Benchmark40%

47.0

Popularity25%

52.6

Efficiency20%

96.7

Versatility15%

67.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	3.3 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	3.7 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	3.9 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	4.1 GB	Excellent	Near-lossless quality with manageable size
Q8_0	4.6 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	6.5 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


AMD Radeon RX 7600 8GBAMD	AA	62.5 tok/s	3.7 GB
NVIDIA GeForce RTX 4060NVIDIA	AA	59.1 tok/s	3.7 GB
NVIDIA GeForce RTX 5060 Ti 8GBNVIDIA	AA	97.3 tok/s	3.7 GB
AMD Radeon RX 7700 XTAMD	AA	93.8 tok/s	3.7 GB
Intel Arc B580Intel	AA	99.0 tok/s	3.7 GB
NVIDIA GeForce RTX 4070NVIDIA	AA	109.4 tok/s	3.7 GB
NVIDIA GeForce RTX 4070 SUPERNVIDIA	AA	109.4 tok/s	3.7 GB
NVIDIA GeForce RTX 5070NVIDIA	AA	145.9 tok/s	3.7 GB
ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	AA	111.2 tok/s	3.7 GB
AMD Radeon RX 7800 XTAMD	AA	135.5 tok/s	3.7 GB
AMD Radeon RX 9070AMD	AA	138.9 tok/s	3.7 GB
AMD Radeon RX 9070 XTAMD	AA	138.9 tok/s	3.7 GB
Google Cloud TPU v5eGoogle	AA	177.8 tok/s	3.7 GB
Intel Arc A770 16GBIntel	AA	121.6 tok/s	3.7 GB
NOVATECH AI Workstation (i9-14900K + RTX 5080)NOVATECH	AA	208.4 tok/s	3.7 GB
NVIDIA GeForce RTX 4060 Ti 16GBNVIDIA	AA	62.5 tok/s	3.7 GB
NVIDIA GeForce RTX 4070 Ti SUPERNVIDIA	AA	145.9 tok/s	3.7 GB
NVIDIA GeForce RTX 4080 SUPERNVIDIA	AA	159.8 tok/s	3.7 GB
NVIDIA GeForce RTX 5060 Ti 16GBNVIDIA	AA	97.3 tok/s	3.7 GB
NVIDIA GeForce RTX 5070 TiNVIDIA	AA	194.5 tok/s	3.7 GB
NVIDIA GeForce RTX 5080 Founders EditionNVIDIA	AA	208.4 tok/s	3.7 GB
AMD Radeon RX 7900 XTAMD	AA	173.7 tok/s	3.7 GB
AMD Radeon RX 7900 XTXAMD	AA	208.4 tok/s	3.7 GB
NVIDIA GeForce RTX 3090NVIDIA	AA	203.2 tok/s	3.7 GB
NVIDIA GeForce RTX 4090 Founders EditionNVIDIA	AA	218.8 tok/s	3.7 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on AMD Radeon RX 7600 8GB (~63 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Gemma 4 E2B IT on AMD Radeon RX 7600 8GB · ~63 tok/s · 165W	$0.088
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 4 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA GeForce RTX 3080Vast.ai · Spot · 10 GB VRAM	$0.04
NVIDIA GeForce RTX 3080Vast.ai · On-Demand · 10 GB VRAM	$0.05
NVIDIA GeForce RTX 3090Vast.ai · Spot · 24 GB VRAM	$0.05
NVIDIA GeForce RTX 3090Vast.ai · On-Demand · 24 GB VRAM	$0.07
NVIDIA GeForce RTX 5070Vast.ai · Spot · 12 GB VRAM	$0.08

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

The Gemma 4 E2B IT is Google’s most efficient edge-optimized model to date, designed specifically for deployment on mobile devices, IoT hardware, and low-power consumer electronics. As an "Effective 2B" parameter model, it utilizes Per-Layer Embeddings to achieve the performance typically expected of larger architectures while maintaining a footprint small enough for real-time interaction on commodity hardware.

Unlike previous iterations of small language models (SLMs) that were text-only, Gemma 4 E2B IT is natively multimodal. It supports vision and audio input alongside its 128,000-token context window, making it a primary candidate for local AI agents that need to process long documents, analyze visual data, or interact via voice without a round-trip to a cloud server. It competes directly with other lightweight models like Llama 3.2 1B/3B and Phi-3.5 Mini, positioning itself as a high-performance alternative for developers who prioritize low latency and minimal battery drain.

Architecture & Technical Details

The "E2B" designation refers to "Effective 2B" parameters. This architecture uses a dense transformer backbone but optimizes the parameter footprint through Per-Layer Embeddings. This allows the model to retain a high degree of reasoning capability and multilingual fluency while keeping the weights compact.

Long-Context Efficiency

The 128K context window is a significant technical milestone for a 2B parameter model. Traditionally, long-context windows in small models suffer from severe "lost in the middle" phenomena or massive KV (Key-Value) cache overhead. However, Gemma 4 E2B IT is optimized for high-retrieval accuracy across its entire context. For practitioners, this means you can load several technical manuals or a large codebase into the local prompt without the model losing track of the initial instructions.

Multimodal Integration

This is a native multimodal model. Rather than using a separate vision encoder or audio-to-text bridge, Gemma 4 E2B IT processes these modalities within its unified architecture. This reduces the VRAM overhead usually required to run multiple specialized models and ensures that the "vision" and "audio" capabilities benefit from the same instruction-tuning as the text-based logic.

Capabilities & Use Cases

Gemma 4 E2B IT is an instruction-tuned (IT) model, meaning it is optimized for chat interfaces and direct command execution. Its multimodal nature and 128K context window make it suitable for several specific local workloads:

Local Document Intelligence: Use the 128K context to summarize long PDFs or search through private archives. Because it runs locally, sensitive data never leaves your machine.
Visual Accessibility & Inspection: The vision capabilities allow for real-time image description, OCR (Optical Character Recognition) on complex layouts, and UI automation tasks where the model "sees" the screen.
Edge Voice Assistants: With native audio input, this model can power low-latency voice interfaces for home automation or specialized industrial hardware.
Multilingual Support: The model is trained on a diverse dataset that enables high-tier performance in non-English languages, making it viable for global deployments in regions with limited connectivity.

Running Gemma 4 E2B IT Locally

The primary draw of this model is its accessibility. You do not need a workstation-class GPU to run Gemma 4 E2B IT; it is designed to be performant on integrated graphics and mobile chips.

Gemma 4 E2B IT Hardware Requirements

The Gemma 4 E2B IT VRAM requirements are exceptionally low. At FP16 precision, the model weights take up approximately 4GB of VRAM. However, most practitioners will run it at 4-bit or 8-bit quantization to save space for the KV cache when utilizing the full 128K context.

Minimum VRAM: 2GB (using Q4_K_M quantization).
Recommended VRAM: 8GB (to comfortably fit the model and a large context window at Q8_0 or FP16).
Best GPU for Gemma 4 E2B IT: While an NVIDIA RTX 4060 or 4070 is more than sufficient, this model is the ideal candidate for Apple Silicon (M2/M3/M4) and even high-end Android or iOS devices.

Recommended Quantization

For most users, the best quantization for Gemma 4 E2B IT is Q4_K_M or Q6_K.

Q4_K_M: Offers the best balance between size and intelligence. It reduces the model footprint to under 2GB, allowing it to run on almost any modern consumer laptop.
Q8_0: Recommended if you are performing complex reasoning or multilingual tasks where precision is critical.

Performance Benchmarks

When you run Gemma 4 E2B IT locally, you can expect high throughput even on modest hardware.

RTX 3060/4060: 120+ tokens per second (TPS).
Apple M3 Max: 100+ TPS.
Raspberry Pi 5 (8GB): 5-10 TPS (usable for non-real-time tasks).

To get started immediately, Ollama is the recommended tool. Running ollama run gemma4:2b will pull the model and configure the environment for your specific hardware automatically.

How It Compares

When evaluating a local AI model with 2B parameters in 2025, the comparison usually falls between Gemma 4, Llama 3.2, and Phi-3.5.

Gemma 4 E2B IT vs Llama 3.2 3B

Llama 3.2 3B is a formidable competitor with strong community support. However, Gemma 4 E2B IT generally edges it out in multimodal tasks, particularly native audio processing. While Llama 3.2 is excellent for text-based instruction following, Gemma’s 128K context window is significantly larger than the standard Llama 3.2 128K implementation's actual performance at the edge, where Gemma's Per-Layer Embeddings provide a memory efficiency advantage.

Gemma 4 E2B IT vs Phi-3.5 Mini

Microsoft’s Phi-3.5 Mini (3.8B) is known for its "punching above its weight" reasoning. While Phi-3.5 might outperform Gemma in complex logic or mathematical coding tasks, Gemma 4 E2B IT is the superior choice for mobile and IoT integration. Gemma’s "Effective 2B" footprint is roughly half the size of Phi-3.5 Mini, making it the better choice for devices with strictly limited RAM (4GB or less).

Summary of Tradeoffs

Choose Gemma 4 E2B IT if: You need native vision/audio support, have less than 4GB of available VRAM, or require a massive context window for document analysis.
Choose alternatives if: You are doing heavy logic-based programming (where larger models like Gemma 27B or Llama 70B are more appropriate) or if your pipeline is already strictly optimized for the GGUF/EXL2 formats of older architectures.

Related Models

Google

Explore the Provider

See all Google models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Google model we track.

Open Google

Explore the Family

See every Gemma release

The full Gemma family leaderboard with sizes, benchmark scores, and a release timeline.

Open Gemma

Find the Best Hardware for This Model

Use our hardware calculator to find the optimal device for running this model.

Google

Gemma 4 E2B IT

2B paramsDense128K ctxMultimodal

View on Hugging Face

Run with Ollama Official Page

Our Take

Best for: Strongest at broad knowledge (MMLU-Pro) in its size class

A solid 2B-parameter dense language model from Google. Pulls ahead on broad knowledge (MMLU-Pro) (60/100), so reach for it when that's the dimension that matters.

Run this onAMD Radeon RX 7600 8GBCheapest card in our directory with comfortable headroom (8 GB) for this model at Q4 (~3.7 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Vision

Multilingual

Instruction Following

Model Specifications

Parameters2B

ArchitectureDense

Context Length128K tokens

ModalityMultimodal

ProviderGoogle

Download Size30.8 GB

Community

Monthly Downloads3.3M

Likes638

Last Updated4 days ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run gemma4:e2b

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Gemma Terms of UseView Full License

Performance & Scoring

Benchmarks

GPQA

43.4

MMLU-PRO

60.0

AIME 2026

37.5

Overall Score

61.3BB

Benchmark40%

47.0

Popularity25%

52.6

Efficiency20%

96.7

Versatility15%

67.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	3.3 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	3.7 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	3.9 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	4.1 GB	Excellent	Near-lossless quality with manageable size
Q8_0	4.6 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	6.5 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


AMD Radeon RX 7600 8GBAMD	AA	62.5 tok/s	3.7 GB
NVIDIA GeForce RTX 4060NVIDIA	AA	59.1 tok/s	3.7 GB
NVIDIA GeForce RTX 5060 Ti 8GBNVIDIA	AA	97.3 tok/s	3.7 GB
AMD Radeon RX 7700 XTAMD	AA	93.8 tok/s	3.7 GB
Intel Arc B580Intel	AA	99.0 tok/s	3.7 GB
NVIDIA GeForce RTX 4070NVIDIA	AA	109.4 tok/s	3.7 GB
NVIDIA GeForce RTX 4070 SUPERNVIDIA	AA	109.4 tok/s	3.7 GB
NVIDIA GeForce RTX 5070NVIDIA	AA	145.9 tok/s	3.7 GB
ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	AA	111.2 tok/s	3.7 GB
AMD Radeon RX 7800 XTAMD	AA	135.5 tok/s	3.7 GB
AMD Radeon RX 9070AMD	AA	138.9 tok/s	3.7 GB
AMD Radeon RX 9070 XTAMD	AA	138.9 tok/s	3.7 GB
Google Cloud TPU v5eGoogle	AA	177.8 tok/s	3.7 GB
Intel Arc A770 16GBIntel	AA	121.6 tok/s	3.7 GB
NOVATECH AI Workstation (i9-14900K + RTX 5080)NOVATECH	AA	208.4 tok/s	3.7 GB
NVIDIA GeForce RTX 4060 Ti 16GBNVIDIA	AA	62.5 tok/s	3.7 GB
NVIDIA GeForce RTX 4070 Ti SUPERNVIDIA	AA	145.9 tok/s	3.7 GB
NVIDIA GeForce RTX 4080 SUPERNVIDIA	AA	159.8 tok/s	3.7 GB
NVIDIA GeForce RTX 5060 Ti 16GBNVIDIA	AA	97.3 tok/s	3.7 GB
NVIDIA GeForce RTX 5070 TiNVIDIA	AA	194.5 tok/s	3.7 GB
NVIDIA GeForce RTX 5080 Founders EditionNVIDIA	AA	208.4 tok/s	3.7 GB
AMD Radeon RX 7900 XTAMD	AA	173.7 tok/s	3.7 GB
AMD Radeon RX 7900 XTXAMD	AA	208.4 tok/s	3.7 GB
NVIDIA GeForce RTX 3090NVIDIA	AA	203.2 tok/s	3.7 GB
NVIDIA GeForce RTX 4090 Founders EditionNVIDIA	AA	218.8 tok/s	3.7 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on AMD Radeon RX 7600 8GB (~63 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Gemma 4 E2B IT on AMD Radeon RX 7600 8GB · ~63 tok/s · 165W	$0.088
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 4 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA GeForce RTX 3080Vast.ai · Spot · 10 GB VRAM	$0.04
NVIDIA GeForce RTX 3080Vast.ai · On-Demand · 10 GB VRAM	$0.05
NVIDIA GeForce RTX 3090Vast.ai · Spot · 24 GB VRAM	$0.05
NVIDIA GeForce RTX 3090Vast.ai · On-Demand · 24 GB VRAM	$0.07
NVIDIA GeForce RTX 5070Vast.ai · Spot · 12 GB VRAM	$0.08

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Architecture & Technical Details

Long-Context Efficiency

Multimodal Integration

Capabilities & Use Cases

Local Document Intelligence: Use the 128K context to summarize long PDFs or search through private archives. Because it runs locally, sensitive data never leaves your machine.
Visual Accessibility & Inspection: The vision capabilities allow for real-time image description, OCR (Optical Character Recognition) on complex layouts, and UI automation tasks where the model "sees" the screen.
Edge Voice Assistants: With native audio input, this model can power low-latency voice interfaces for home automation or specialized industrial hardware.
Multilingual Support: The model is trained on a diverse dataset that enables high-tier performance in non-English languages, making it viable for global deployments in regions with limited connectivity.

Running Gemma 4 E2B IT Locally

The primary draw of this model is its accessibility. You do not need a workstation-class GPU to run Gemma 4 E2B IT; it is designed to be performant on integrated graphics and mobile chips.

Gemma 4 E2B IT Hardware Requirements

Minimum VRAM: 2GB (using Q4_K_M quantization).
Recommended VRAM: 8GB (to comfortably fit the model and a large context window at Q8_0 or FP16).
Best GPU for Gemma 4 E2B IT: While an NVIDIA RTX 4060 or 4070 is more than sufficient, this model is the ideal candidate for Apple Silicon (M2/M3/M4) and even high-end Android or iOS devices.

Recommended Quantization

For most users, the best quantization for Gemma 4 E2B IT is Q4_K_M or Q6_K.

Q4_K_M: Offers the best balance between size and intelligence. It reduces the model footprint to under 2GB, allowing it to run on almost any modern consumer laptop.
Q8_0: Recommended if you are performing complex reasoning or multilingual tasks where precision is critical.

Performance Benchmarks

When you run Gemma 4 E2B IT locally, you can expect high throughput even on modest hardware.

RTX 3060/4060: 120+ tokens per second (TPS).
Apple M3 Max: 100+ TPS.
Raspberry Pi 5 (8GB): 5-10 TPS (usable for non-real-time tasks).

To get started immediately, Ollama is the recommended tool. Running ollama run gemma4:2b will pull the model and configure the environment for your specific hardware automatically.

How It Compares

When evaluating a local AI model with 2B parameters in 2025, the comparison usually falls between Gemma 4, Llama 3.2, and Phi-3.5.

Gemma 4 E2B IT vs Llama 3.2 3B

Gemma 4 E2B IT vs Phi-3.5 Mini

Summary of Tradeoffs

Choose Gemma 4 E2B IT if: You need native vision/audio support, have less than 4GB of available VRAM, or require a massive context window for document analysis.
Choose alternatives if: You are doing heavy logic-based programming (where larger models like Gemma 27B or Llama 70B are more appropriate) or if your pipeline is already strictly optimized for the GGUF/EXL2 formats of older architectures.