Google

Gemma 4 12B

Google's first unified, encoder-free Gemma model: a 12B dense multimodal model that projects raw image patches and audio waveforms straight into the language model instead of using separate encoders. It accepts text, image, audio, and video input, supports a 256K-token context, ships open-weight under Apache 2.0, and is small enough to run locally on a 16GB GPU. It was trained on data through January 2025 and scores 77.2% on MMLU-Pro, 78.8% on GPQA Diamond, 77.5% on AIME 2026 without tools, and 69.1% on MMMU Pro.

12B paramsDense256K ctxMultimodal

View on Hugging Face

Run with Ollama Official Page

Our Take

Best for: Strongest at graduate-level reasoning (GPQA) in its size class

A solid 12B-parameter dense language model from Google. Pulls ahead on graduate-level reasoning (GPQA) (75/100), so reach for it when that's the dimension that matters. Newly released, so production-readiness is still being shaken out.

Run this onCorsair AI Workstation 300 (Ryzen AI Max 385)Cheapest card in our directory with comfortable headroom (48 GB) for this model at Q4 (~32.0 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Vision

Reasoning

Function Calling

Multilingual

Math

Instruction Following

Model Specifications

Parameters12B

ArchitectureDense

Context Length256K tokens

ModalityMultimodal

Training CutoffJanuary 2025

ProviderGoogle

Download Size71.8 GB

Community

Monthly Downloads2.2M

Likes1.2K

Last Updated22 days ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run gemma4:12b

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

GPQA

75.3

HLE

14.8

AA Intelligence Index

22.0

38.2

18.2

73.5

55.3

MBA Open Score

60.9BB

Benchmark40%

42.5

Popularity25%

68.7

Efficiency20%

60.6

Versatility15%

97.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	29.5 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	32.0 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	33.2 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	34.7 GB	Excellent	Near-lossless quality with manageable size
Q8_0	37.7 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	49.1 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


NVIDIA A100 SXM4 80GBNVIDIA	SS	51.2 tok/s	32.0 GB
NVIDIA H100 SXM5 80GBNVIDIA	SS	84.2 tok/s	32.0 GB
Google Cloud TPU v5pGoogle	SS	69.5 tok/s	32.0 GB
Intel Gaudi 2 AI AcceleratorIntel	SS	61.6 tok/s	32.0 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	93.0 tok/s	32.0 GB
NVIDIA H200 SXM 141GBNVIDIA	SS	120.6 tok/s	32.0 GB
AMD Instinct MI300XAMD	SS	133.2 tok/s	32.0 GB
Google TPU v7 (Ironwood)Google	SS	185.4 tok/s	32.0 GB
NVIDIA B200 GPUNVIDIA	SS	201.0 tok/s	32.0 GB
AMD Instinct MI325XAMD	SS	150.8 tok/s	32.0 GB
AMD Instinct MI355XAMD	AA	201.0 tok/s	32.0 GB
ASUS ExpertCenter Pro ET900N G3ASUS	AA	178.4 tok/s	32.0 GB
Dell Pro Max with GB300Dell	AA	178.4 tok/s	32.0 GB
HP ZGX Fury AI StationHP	AA	178.4 tok/s	32.0 GB
MSI XpertStation WS300MSI	AA	178.4 tok/s	32.0 GB
SuperMicro Super AI StationSuperMicro	AA	178.4 tok/s	32.0 GB
Gigabyte W775-V10-L01Gigabyte	AA	178.4 tok/s	32.0 GB
NVIDIA RTX 6000 Ada GenerationNVIDIA	AA	24.1 tok/s	32.0 GB
Origin PC L-CLASS v2Origin PC	AA	24.1 tok/s	32.0 GB
NVIDIA L40SNVIDIA	AA	21.7 tok/s	32.0 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	AA	20.1 tok/s	32.0 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	20.1 tok/s	32.0 GB
Apple Mac Studio (M1 Max, 2022)Apple	BB	10.1 tok/s	32.0 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	BB	20.6 tok/s	32.0 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	15.4 tok/s	32.0 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on Apple M4 (~3.0 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Gemma 4 12B on Apple M4 · ~3.0 tok/s · 25W	$0.276
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 32 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
AMD Instinct MI300XRunPod · Community · 192 GB VRAM	$0.50
NVIDIA H200 NVLRunPod · Community · 141 GB VRAM	$0.50
NVIDIA A100 80GB PCIeVast.ai · Spot · 80 GB VRAM	$0.62
NVIDIA A100 80GB PCIeVast.ai · On-Demand · 80 GB VRAM	$0.62
NVIDIA L40RunPod · Community · 48 GB VRAM	$0.69

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

Gemma 4 12B is Google DeepMind’s first unified, encoder-free multimodal model for local deployment. At 12 billion parameters, it sits between the edge‑optimized E4B and the 26B MoE variant, offering a dense architecture that fits into a 16GB GPU while delivering benchmark performance near the larger model. Trained on data through January 2025 and released under Apache 2.0, it is designed for developers who need both vision and language capabilities on consumer hardware — no cloud API calls required.

The model accepts text, images, audio, and video input, and supports a 256,000‑token context window. It scores 77.2% on MMLU-Pro, 78.8% on GPQA Diamond, 77.5% on AIME 2026 (without tools), and 69.1% on MMMU Pro — numbers that put it ahead of most 7B–13B dense models and competitive with much larger MoE alternatives. For practitioners running local AI models with 12B parameters in 2026, Gemma 4 12B is a strong candidate for workloads that require both reasoning and multimodal understanding.

Architecture & Technical Details

Gemma 4 12B uses a dense transformer backbone — no mixture‑of‑experts, no separate vision or audio encoders. Raw image patches and audio waveforms are projected directly into the language model’s embedding space, a design that reduces multimodal latency by bypassing heavy multi‑stage encoder pipelines. This encoder‑free approach also simplifies deployment: you load one model, not two, and the memory footprint stays predictable.

Key specs:

Parameters: 12B (all active at inference)
Context length: 256,000 tokens
Modality: text, image, audio, video input → text output
License: Apache 2.0
Drafters: Multi‑Token Prediction (MTP) drafters available to reduce latency at inference (requires loading a separate drafter model, but optional)

Because it is dense, every forward pass uses all 12B parameters. This means VRAM usage is proportional to the full model size, unlike MoE models where only a fraction of parameters are active per token. For a given quantization level, you can expect consistent memory consumption and throughput — no variance from routing decisions.

The 256K context enables long‑form reasoning, code analysis over repositories, and processing of lengthy audio transcripts or video segments without chunking. Multilingual support covers over 140 languages, and the model natively accepts the system role for structured conversations.

Capabilities & Use Cases

Gemma 4 12B was trained for chat, code, vision, reasoning, function‑calling, multilingual, math, and instruction‑following. It is not a specialist model — it’s a generalist that performs well across these axes without needing separate adapters.

Reasoning & Math

With 77.5% on AIME 2026 (no tools) and 78.8% on GPQA Diamond, the model handles multi‑step reasoning and mathematical proofs. This is useful for agentic workflows where the model must break tasks into sub‑goals, call functions, and verify intermediate results.

Coding

Benchmarks indicate strong coding capability comparable to the 26B MoE variant. Expect solid performance on common languages (Python, JavaScript, C++, Go) and reasonable support for niche ones. The 256K context allows feeding entire codebases or diff histories for refactoring and review.

Vision & Audio

Because there are no external encoders, the model processes images and audio natively. This means lower latency for multimodal tasks like diagram analysis, OCR on screenshots, or transcribing and reasoning over meeting recordings. Video input is supported as a sequence of frames (no dedicated video encoder) — practical for short clips or keyframe analysis.

Function‑Calling & Agents

Native function‑calling support enables autonomous workflows. Developers can define tools and let the model decide when to invoke them — useful for local AI agents that query databases, run shell commands, or interact with APIs without round‑tripping to a cloud endpoint.

Multilingual

Trained on data in over 140 languages, with strong performance on European and East Asian languages. Instruction‑following quality remains consistent across languages, making it suitable for deploying a single model globally.

Running Gemma 4 12B Locally

The model is designed for a 16GB GPU, but real‑world VRAM requirements depend on quantization and context length.

Quantization	VRAM (approx.)	Typical hardware	Expected tokens/s (16GB GPU)
Q4_K_M (recommended)	~8 GB	RTX 3090, RTX 4070 Ti, M4 Pro (24GB unified)	20–40 tok/s
Q5_K_M	~10 GB	RTX 4090, A6000, M4 Max (48GB)	15–25 tok/s
Q8_0	~13 GB	RTX 4090, dual GPU setups	10–15 tok/s
FP16 (full)	~24 GB	A100, dual RTX 4090	5–10 tok/s

Minimum hardware: A GPU with 8 GB VRAM will run Q4_K_M at small contexts (<8K tokens). For the full 256K context, you need at least 16 GB VRAM (the KV cache grows linearly). The model was tested on single RTX 4090 (24 GB), M4 Max (48 GB unified), and systems with 16 GB (RTX 4060 Ti 16 GB).

Recommended setup for most users:

GPU with 16–24 GB VRAM
Q4_K_M quantization (best quality per GB)
Ollama (ollama run gemma4) or Hugging Face Transformers with load_in_4bit=True
Optionally load the MTP drafter for up to 2× throughput improvement (requires ~2 GB extra VRAM)

Performance notes:

Without drafter, expect 20–30 tok/s on a 16 GB GPU (4‑bit).
With MTP drafter, throughput can reach 40+ tok/s for text‑only prompts.
Multimodal inputs (images/audio) increase latency by 100–300 ms per token due to projection.

For detailed benchmarking, see the Ollama model page (9.6 GB for the gemma4:latest tag) or Google’s developer guide.

How It Compares

vs. Llama 3.2 8B (Meta)

Size: 8B vs 12B — Gemma has 50% more parameters, which shows in reasoning and math benchmarks.
Multimodality: Llama 3.2 8B is text‑only. Gemma 4 12B handles images, audio, and video natively.
Context: 128K vs 256K.
License: Both Apache 2.0.
Verdict: Choose Gemma when you need vision or longer context; Llama 3.2 is lighter and faster on constrained hardware (6 GB VRAM for Q4).

vs. Qwen2.5 14B (Alibaba)

Size: 14B vs 12B — Qwen is larger, often better on coding and multilingual tasks.
Multimodality: Qwen2.5 14B is text‑only (vision variants exist but are separate). Gemma’s unified encoder‑free approach is more convenient.
Context: 32K vs 256K.
License: Apache 2.0 (Qwen) vs Apache 2.0 (Gemma).
Verdict: Qwen2.5 14B has an edge in pure language tasks and fits a 16 GB GPU (Q4), but lacks native audio and vision. Gemma is the better generalist for multimodal agentic workflows.

vs. Gemma 4 26B MoE (Google)

Size: 26B total (9B active) vs 12B dense.
Performance: Gemma 4 12B scores within a few points of the 26B on many benchmarks while using half the VRAM.
Latency: The dense 12B has more predictable throughput; the MoE 26B can be faster per token but requires careful routing.
Verdict: The 12B is the practical choice for single‑GPU local inference. The 26B only makes sense if you have 48+ GB VRAM and need the extra few points on reasoning.

In short, Gemma 4 12B occupies a unique spot: dense, local‑friendly, multimodal, and open. It is not the fastest at pure text generation, but for developers who need to run models that see, hear, and reason on a single consumer GPU, it is currently one of the strongest options available.

Related Models

Google

Explore the Provider

See all Google models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Google model we track.

Open Google

Explore the Family

See every Gemma release

The full Gemma family leaderboard with sizes, benchmark scores, and a release timeline.

Open Gemma

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

12B

Google

Gemma 4 12B

12B paramsDense256K ctxMultimodal

View on Hugging Face

Run with Ollama Official Page

Our Take

Best for: Strongest at graduate-level reasoning (GPQA) in its size class

Run this onCorsair AI Workstation 300 (Ryzen AI Max 385)Cheapest card in our directory with comfortable headroom (48 GB) for this model at Q4 (~32.0 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Vision

Reasoning

Function Calling

Multilingual

Math

Instruction Following

Model Specifications

Parameters12B

ArchitectureDense

Context Length256K tokens

ModalityMultimodal

Training CutoffJanuary 2025

ProviderGoogle

Download Size71.8 GB

Community

Monthly Downloads2.2M

Likes1.2K

Last Updated22 days ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run gemma4:12b

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

GPQA

75.3

HLE

14.8

AA Intelligence Index

22.0

38.2

18.2

73.5

55.3

MBA Open Score

60.9BB

Benchmark40%

42.5

Popularity25%

68.7

Efficiency20%

60.6

Versatility15%

97.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	29.5 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	32.0 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	33.2 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	34.7 GB	Excellent	Near-lossless quality with manageable size
Q8_0	37.7 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	49.1 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


NVIDIA A100 SXM4 80GBNVIDIA	SS	51.2 tok/s	32.0 GB
NVIDIA H100 SXM5 80GBNVIDIA	SS	84.2 tok/s	32.0 GB
Google Cloud TPU v5pGoogle	SS	69.5 tok/s	32.0 GB
Intel Gaudi 2 AI AcceleratorIntel	SS	61.6 tok/s	32.0 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	93.0 tok/s	32.0 GB
NVIDIA H200 SXM 141GBNVIDIA	SS	120.6 tok/s	32.0 GB
AMD Instinct MI300XAMD	SS	133.2 tok/s	32.0 GB
Google TPU v7 (Ironwood)Google	SS	185.4 tok/s	32.0 GB
NVIDIA B200 GPUNVIDIA	SS	201.0 tok/s	32.0 GB
AMD Instinct MI325XAMD	SS	150.8 tok/s	32.0 GB
AMD Instinct MI355XAMD	AA	201.0 tok/s	32.0 GB
ASUS ExpertCenter Pro ET900N G3ASUS	AA	178.4 tok/s	32.0 GB
Dell Pro Max with GB300Dell	AA	178.4 tok/s	32.0 GB
HP ZGX Fury AI StationHP	AA	178.4 tok/s	32.0 GB
MSI XpertStation WS300MSI	AA	178.4 tok/s	32.0 GB
SuperMicro Super AI StationSuperMicro	AA	178.4 tok/s	32.0 GB
Gigabyte W775-V10-L01Gigabyte	AA	178.4 tok/s	32.0 GB
NVIDIA RTX 6000 Ada GenerationNVIDIA	AA	24.1 tok/s	32.0 GB
Origin PC L-CLASS v2Origin PC	AA	24.1 tok/s	32.0 GB
NVIDIA L40SNVIDIA	AA	21.7 tok/s	32.0 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	AA	20.1 tok/s	32.0 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	20.1 tok/s	32.0 GB
Apple Mac Studio (M1 Max, 2022)Apple	BB	10.1 tok/s	32.0 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	BB	20.6 tok/s	32.0 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	15.4 tok/s	32.0 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on Apple M4 (~3.0 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Gemma 4 12B on Apple M4 · ~3.0 tok/s · 25W	$0.276
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 32 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
AMD Instinct MI300XRunPod · Community · 192 GB VRAM	$0.50
NVIDIA H200 NVLRunPod · Community · 141 GB VRAM	$0.50
NVIDIA A100 80GB PCIeVast.ai · Spot · 80 GB VRAM	$0.62
NVIDIA A100 80GB PCIeVast.ai · On-Demand · 80 GB VRAM	$0.62
NVIDIA L40RunPod · Community · 48 GB VRAM	$0.69

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

Architecture & Technical Details

Key specs:

Parameters: 12B (all active at inference)
Context length: 256,000 tokens
Modality: text, image, audio, video input → text output
License: Apache 2.0
Drafters: Multi‑Token Prediction (MTP) drafters available to reduce latency at inference (requires loading a separate drafter model, but optional)

Capabilities & Use Cases

Reasoning & Math

Coding

Vision & Audio

Function‑Calling & Agents

Multilingual

Running Gemma 4 12B Locally

The model is designed for a 16GB GPU, but real‑world VRAM requirements depend on quantization and context length.

Quantization	VRAM (approx.)	Typical hardware	Expected tokens/s (16GB GPU)
Q4_K_M (recommended)	~8 GB	RTX 3090, RTX 4070 Ti, M4 Pro (24GB unified)	20–40 tok/s
Q5_K_M	~10 GB	RTX 4090, A6000, M4 Max (48GB)	15–25 tok/s
Q8_0	~13 GB	RTX 4090, dual GPU setups	10–15 tok/s
FP16 (full)	~24 GB	A100, dual RTX 4090	5–10 tok/s

Recommended setup for most users:

GPU with 16–24 GB VRAM
Q4_K_M quantization (best quality per GB)
Ollama (ollama run gemma4) or Hugging Face Transformers with load_in_4bit=True
Optionally load the MTP drafter for up to 2× throughput improvement (requires ~2 GB extra VRAM)

Performance notes:

Without drafter, expect 20–30 tok/s on a 16 GB GPU (4‑bit).
With MTP drafter, throughput can reach 40+ tok/s for text‑only prompts.
Multimodal inputs (images/audio) increase latency by 100–300 ms per token due to projection.

For detailed benchmarking, see the Ollama model page (9.6 GB for the gemma4:latest tag) or Google’s developer guide.

How It Compares

vs. Llama 3.2 8B (Meta)

Size: 8B vs 12B — Gemma has 50% more parameters, which shows in reasoning and math benchmarks.
Multimodality: Llama 3.2 8B is text‑only. Gemma 4 12B handles images, audio, and video natively.
Context: 128K vs 256K.
License: Both Apache 2.0.
Verdict: Choose Gemma when you need vision or longer context; Llama 3.2 is lighter and faster on constrained hardware (6 GB VRAM for Q4).

vs. Qwen2.5 14B (Alibaba)

Size: 14B vs 12B — Qwen is larger, often better on coding and multilingual tasks.
Multimodality: Qwen2.5 14B is text‑only (vision variants exist but are separate). Gemma’s unified encoder‑free approach is more convenient.
Context: 32K vs 256K.
License: Apache 2.0 (Qwen) vs Apache 2.0 (Gemma).
Verdict: Qwen2.5 14B has an edge in pure language tasks and fits a 16 GB GPU (Q4), but lacks native audio and vision. Gemma is the better generalist for multimodal agentic workflows.

vs. Gemma 4 26B MoE (Google)

Size: 26B total (9B active) vs 12B dense.
Performance: Gemma 4 12B scores within a few points of the 26B on many benchmarks while using half the VRAM.
Latency: The dense 12B has more predictable throughput; the MoE 26B can be faster per token but requires careful routing.
Verdict: The 12B is the practical choice for single‑GPU local inference. The 26B only makes sense if you have 48+ GB VRAM and need the extra few points on reasoning.