Google's first unified, encoder-free Gemma model: a 12B dense multimodal model that projects raw image patches and audio waveforms straight into the language model instead of using separate encoders. It accepts text, image, audio, and video input, supports a 256K-token context, ships open-weight under Apache 2.0, and is small enough to run locally on a 16GB GPU. It was trained on data through January 2025 and scores 77.2% on MMLU-Pro, 78.8% on GPQA Diamond, 77.5% on AIME 2026 without tools, and 69.1% on MMMU Pro.
A solid 12B-parameter dense language model from Google. Pulls ahead on graduate-level reasoning (GPQA) (75/100), so reach for it when that's the dimension that matters. Newly released, so production-readiness is still being shaken out.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Copy and paste this command to start running the model locally.
ollama run gemma4:12bAccess model weights, configuration files, and documentation.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 29.5 GB | Low | |
| Q4_K_MRecommended | 32.0 GB | Good | |
| Q5_K_M | 33.2 GB | Very Good | |
| Q6_K | 34.7 GB | Excellent | |
| Q8_0 | 37.7 GB | Near Perfect | |
| FP16 | 49.1 GB | Full |
See which devices can run this model and at what quality level.
NVIDIA A100 SXM4 80GBNVIDIA | SS | 51.2 tok/s | 32.0 GB | |
NVIDIA H100 SXM5 80GBNVIDIA | SS | 84.2 tok/s | 32.0 GB | |
Google Cloud TPU v5pGoogle | SS | 69.5 tok/s | 32.0 GB | |
| SS | 61.6 tok/s | 32.0 GB | ||
| SS | 93.0 tok/s | 32.0 GB | ||
NVIDIA H200 SXM 141GBNVIDIA | SS | 120.6 tok/s | 32.0 GB | |
| SS | 133.2 tok/s | 32.0 GB | ||
Google TPU v7 (Ironwood)Google | SS | 185.4 tok/s | 32.0 GB | |
NVIDIA B200 GPUNVIDIA | SS | 201.0 tok/s | 32.0 GB | |
| SS | 150.8 tok/s | 32.0 GB | ||
| AA | 201.0 tok/s | 32.0 GB | ||
| AA | 178.4 tok/s | 32.0 GB | ||
| AA | 178.4 tok/s | 32.0 GB | ||
| AA | 178.4 tok/s | 32.0 GB | ||
| AA | 178.4 tok/s | 32.0 GB | ||
SuperMicro Super AI StationSuperMicro | AA | 178.4 tok/s | 32.0 GB | |
Gigabyte W775-V10-L01Gigabyte | AA | 178.4 tok/s | 32.0 GB | |
| AA | 24.1 tok/s | 32.0 GB | ||
Origin PC L-CLASS v2Origin PC | AA | 24.1 tok/s | 32.0 GB | |
NVIDIA L40SNVIDIA | AA | 21.7 tok/s | 32.0 GB | |
| AA | 20.1 tok/s | 32.0 GB | ||
| BB | 20.1 tok/s | 32.0 GB | ||
| BB | 10.1 tok/s | 32.0 GB | ||
| BB | 20.6 tok/s | 32.0 GB | ||
| BB | 15.4 tok/s | 32.0 GB |
Energy cost on Apple M4 (~3.0 tok/s, Q4_K_M) vs flagship API pricing.
| Source | Cost per 1M tokens |
|---|---|
Local (energy only)Gemma 4 12B on Apple M4 · ~3.0 tok/s · 25W | $0.276 |
GPT-5.5OpenAI · in $5.00 · out $30.00 | $12.50 |
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00 | $11.00 |
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00 | $3.75 |
Grok 4.3xAI · in $1.25 · out $2.50 | $1.63 |
API prices blended at 70% input / 30% output.
Hardware amortisation not included. Run the full ROI calculator for payback math.
Cheapest current cloud rentals with at least 32 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
AMD Instinct MI300XRunPod · Community · 192 GB VRAM | $0.50 |
NVIDIA H200 NVLRunPod · Community · 141 GB VRAM | $0.50 |
NVIDIA A100 80GB PCIeVast.ai · Spot · 80 GB VRAM | $0.62 |
NVIDIA A100 80GB PCIeVast.ai · On-Demand · 80 GB VRAM | $0.62 |
NVIDIA L40RunPod · Community · 48 GB VRAM | $0.69 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
Gemma 4 12B is Google DeepMind’s first unified, encoder-free multimodal model for local deployment. At 12 billion parameters, it sits between the edge‑optimized E4B and the 26B MoE variant, offering a dense architecture that fits into a 16GB GPU while delivering benchmark performance near the larger model. Trained on data through January 2025 and released under Apache 2.0, it is designed for developers who need both vision and language capabilities on consumer hardware — no cloud API calls required.
The model accepts text, images, audio, and video input, and supports a 256,000‑token context window. It scores 77.2% on MMLU-Pro, 78.8% on GPQA Diamond, 77.5% on AIME 2026 (without tools), and 69.1% on MMMU Pro — numbers that put it ahead of most 7B–13B dense models and competitive with much larger MoE alternatives. For practitioners running local AI models with 12B parameters in 2026, Gemma 4 12B is a strong candidate for workloads that require both reasoning and multimodal understanding.
Gemma 4 12B uses a dense transformer backbone — no mixture‑of‑experts, no separate vision or audio encoders. Raw image patches and audio waveforms are projected directly into the language model’s embedding space, a design that reduces multimodal latency by bypassing heavy multi‑stage encoder pipelines. This encoder‑free approach also simplifies deployment: you load one model, not two, and the memory footprint stays predictable.
Key specs:
Because it is dense, every forward pass uses all 12B parameters. This means VRAM usage is proportional to the full model size, unlike MoE models where only a fraction of parameters are active per token. For a given quantization level, you can expect consistent memory consumption and throughput — no variance from routing decisions.
The 256K context enables long‑form reasoning, code analysis over repositories, and processing of lengthy audio transcripts or video segments without chunking. Multilingual support covers over 140 languages, and the model natively accepts the system role for structured conversations.
Gemma 4 12B was trained for chat, code, vision, reasoning, function‑calling, multilingual, math, and instruction‑following. It is not a specialist model — it’s a generalist that performs well across these axes without needing separate adapters.
With 77.5% on AIME 2026 (no tools) and 78.8% on GPQA Diamond, the model handles multi‑step reasoning and mathematical proofs. This is useful for agentic workflows where the model must break tasks into sub‑goals, call functions, and verify intermediate results.
Benchmarks indicate strong coding capability comparable to the 26B MoE variant. Expect solid performance on common languages (Python, JavaScript, C++, Go) and reasonable support for niche ones. The 256K context allows feeding entire codebases or diff histories for refactoring and review.
Because there are no external encoders, the model processes images and audio natively. This means lower latency for multimodal tasks like diagram analysis, OCR on screenshots, or transcribing and reasoning over meeting recordings. Video input is supported as a sequence of frames (no dedicated video encoder) — practical for short clips or keyframe analysis.
Native function‑calling support enables autonomous workflows. Developers can define tools and let the model decide when to invoke them — useful for local AI agents that query databases, run shell commands, or interact with APIs without round‑tripping to a cloud endpoint.
Trained on data in over 140 languages, with strong performance on European and East Asian languages. Instruction‑following quality remains consistent across languages, making it suitable for deploying a single model globally.
The model is designed for a 16GB GPU, but real‑world VRAM requirements depend on quantization and context length.
| Quantization | VRAM (approx.) | Typical hardware | Expected tokens/s (16GB GPU) |
|---|---|---|---|
| Q4_K_M (recommended) | ~8 GB | RTX 3090, RTX 4070 Ti, M4 Pro (24GB unified) | 20–40 tok/s |
| Q5_K_M | ~10 GB | RTX 4090, A6000, M4 Max (48GB) | 15–25 tok/s |
| Q8_0 | ~13 GB | RTX 4090, dual GPU setups | 10–15 tok/s |
| FP16 (full) | ~24 GB | A100, dual RTX 4090 | 5–10 tok/s |
Minimum hardware: A GPU with 8 GB VRAM will run Q4_K_M at small contexts (<8K tokens). For the full 256K context, you need at least 16 GB VRAM (the KV cache grows linearly). The model was tested on single RTX 4090 (24 GB), M4 Max (48 GB unified), and systems with 16 GB (RTX 4060 Ti 16 GB).
Recommended setup for most users:
Q4_K_M quantization (best quality per GB) ollama run gemma4) or Hugging Face Transformers with load_in_4bit=True Performance notes:
For detailed benchmarking, see the Ollama model page (9.6 GB for the gemma4:latest tag) or Google’s developer guide.
In short, Gemma 4 12B occupies a unique spot: dense, local‑friendly, multimodal, and open. It is not the fastest at pure text generation, but for developers who need to run models that see, hear, and reason on a single consumer GPU, it is currently one of the strongest options available.

Explore the Provider
Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Google model we track.

Explore the Family
The full Gemma family leaderboard with sizes, benchmark scores, and a release timeline.