A community fine-tune of Google's open-weight gemma-4-12B-it, built by yuxinlu1 and shipped as GGUF for local inference. It was trained on execution-verified Python reasoning traces: real chains of thought from Composer 2.5 (only solutions that passed their tests were kept) plus a Fable 5 "second attempt" set covering the harder problems Composer 2.5 missed, both verified by running the code. The model reasons in the open before emitting a solution and is focused on Python and algorithmic coding. It keeps the base model's 256K context window and Apache 2.0 license.
A solid 12B-parameter dense language model from yuxinlu1. A pragmatic middle-ground choice when you need open weights without a flagship-sized footprint.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
No benchmark data available for this model yet.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 29.5 GB | Low | |
| Q4_K_MRecommended | 32.0 GB | Good | |
| Q5_K_M | 33.2 GB | Very Good | |
| Q6_K | 34.7 GB | Excellent | |
| Q8_0 | 37.7 GB | Near Perfect | |
| FP16 | 49.1 GB | Full |
See which devices can run this model and at what quality level.
NVIDIA A100 SXM4 80GBNVIDIA | SS | 51.2 tok/s | 32.0 GB | |
NVIDIA H100 SXM5 80GBNVIDIA | SS | 84.2 tok/s | 32.0 GB | |
Google Cloud TPU v5pGoogle | SS | 69.5 tok/s | 32.0 GB | |
| SS | 61.6 tok/s | 32.0 GB | ||
| SS | 93.0 tok/s | 32.0 GB | ||
NVIDIA H200 SXM 141GBNVIDIA | SS | 120.6 tok/s | 32.0 GB | |
| SS | 133.2 tok/s | 32.0 GB | ||
Google TPU v7 (Ironwood)Google | SS | 185.4 tok/s | 32.0 GB | |
NVIDIA B200 GPUNVIDIA | SS | 201.0 tok/s | 32.0 GB | |
| SS | 150.8 tok/s | 32.0 GB | ||
| AA | 201.0 tok/s | 32.0 GB | ||
| AA | 178.4 tok/s | 32.0 GB | ||
| AA | 178.4 tok/s | 32.0 GB | ||
| AA | 178.4 tok/s | 32.0 GB | ||
| AA | 178.4 tok/s | 32.0 GB | ||
SuperMicro Super AI StationSuperMicro | AA | 178.4 tok/s | 32.0 GB | |
Gigabyte W775-V10-L01Gigabyte | AA | 178.4 tok/s | 32.0 GB | |
| AA | 24.1 tok/s | 32.0 GB | ||
Origin PC L-CLASS v2Origin PC | AA | 24.1 tok/s | 32.0 GB | |
NVIDIA L40SNVIDIA | AA | 21.7 tok/s | 32.0 GB | |
| AA | 20.1 tok/s | 32.0 GB | ||
| BB | 20.1 tok/s | 32.0 GB | ||
| BB | 10.1 tok/s | 32.0 GB | ||
| BB | 20.6 tok/s | 32.0 GB | ||
| BB | 15.4 tok/s | 32.0 GB |
Energy cost on Apple M4 (~3.0 tok/s, Q4_K_M) vs flagship API pricing.
| Source | Cost per 1M tokens |
|---|---|
Local (energy only)Gemma 4 12B Coder on Apple M4 · ~3.0 tok/s · 25W | $0.276 |
GPT-5.5OpenAI · in $5.00 · out $30.00 | $12.50 |
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00 | $11.00 |
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00 | $3.75 |
Grok 4.3xAI · in $1.25 · out $2.50 | $1.63 |
API prices blended at 70% input / 30% output.
Hardware amortisation not included. Run the full ROI calculator for payback math.
Cheapest current cloud rentals with at least 32 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
AMD Instinct MI300XRunPod · Community · 192 GB VRAM | $0.50 |
NVIDIA H200 NVLRunPod · Community · 141 GB VRAM | $0.50 |
NVIDIA A100 80GB PCIeVast.ai · Spot · 80 GB VRAM | $0.62 |
NVIDIA A100 80GB PCIeVast.ai · On-Demand · 80 GB VRAM | $0.62 |
NVIDIA L40RunPod · Community · 48 GB VRAM | $0.69 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
Gemma 4 12B Coder is a community fine-tune of Google’s open-weight gemma-4-12B-it, built by developer yuxinlu1 and distributed as GGUF for local inference. This is a dense 12B-parameter text-only model focused on Python and algorithmic coding, with a training pipeline that prioritizes verifiable correctness over volume.
What distinguishes this model from the base Gemma 4 12B IT is the training data strategy. The fine-tune uses execution-verified reasoning traces: real chains of thought from Composer 2.5 where only solutions that passed their test suites were retained, plus a Fable 5 “second attempt” set covering the harder problems Composer 2.5 missed—again verified by running the code. The model reasons in the open before emitting a solution, making its decision process inspectable.
This model occupies the niche of small, local-first coding assistants that can run on consumer hardware. It competes with other 12B-class coding models like CodeQwen 1.5 7B and DeepSeek-Coder 6.7B, but brings the Gemma 4 architecture’s 256K context window and Apache 2.0 licensing. It is not a general-purpose assistant—it is purpose-built for Python coding tasks where reasoning transparency and solution correctness matter.
Gemma 4 12B Coder uses a dense transformer architecture with 12 billion parameters. Unlike mixture-of-experts (MoE) models that activate only a subset of parameters per token, this is a dense model—all 12B parameters are active for every forward pass. This means consistent memory usage regardless of prompt complexity, and no routing overhead that can introduce latency variance.
The tradeoff is straightforward: dense models at this size require more VRAM than MoE models with comparable active parameter counts, but they deliver more predictable inference behavior and simpler deployment. For a 12B dense model, you are looking at approximately 6-7 GB of VRAM for a Q4 quant, scaling up to 12-13 GB for Q8 or FP16.
The context length is 256,000 tokens (262,144 positions), which is the full Gemma 4 specification. An earlier metadata bug in the upstream Gemma 4 release caused some quants to report only 131K context—this has been corrected in all current GGUF releases. The 256K window enables processing entire codebases, long documentation, or multi-file reasoning tasks without chunking.
The model is text-only. No multimodal capabilities, no vision, no audio. This is a deliberate constraint—every parameter is allocated to text reasoning and code generation.
Gemma 4 12B Coder is built for Python coding with an emphasis on algorithmic problem-solving. The training data consists entirely of function-level coding tasks with deterministic test suites, so the model is strongest at:
The model is not optimized for web development, API integration, or general-purpose chat. It will attempt those tasks, but its training distribution is heavily weighted toward algorithmic Python. If you need a model for Django, Flask, or front-end work, look elsewhere.
Concrete use cases:
This is where the model delivers its primary value: running on consumer hardware without cloud dependencies.
| Quantization | VRAM Required | Quality Tradeoff |
|---|---|---|
| Q4_K_M | ~4.5 GB | Good for most tasks, recommended default |
| Q5_K_M | ~5.5 GB | Better quality, minor VRAM increase |
| Q8_0 | ~7.5 GB | Near-lossless, requires more VRAM |
| FP16 (safetensors) | ~13 GB | Full precision, for fine-tuning or maximum quality |
On an RTX 4090 with Q4_K_M and 2048-token context, expect 40-60 tokens/second. On an M4 Max with Q8_0, expect 30-50 tokens/second. On an RTX 4070 Ti with Q4_K_M, expect 25-40 tokens/second. These numbers vary with prompt length, batch size, and backend configuration.
The fastest way to run Gemma 4 12B Coder locally is through Ollama. The GGUF quants are available on Hugging Face and can be imported directly:
1ollama run hf.co/yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF:Q4_K_M
For llama.cpp directly, download the GGUF file and run:
1./llama-cli -m gemma-4-12B-coder-Q4_K_M.gguf -p "Your prompt here" -n 1024
The model also works with LM Studio, Jan, and any backend that supports GGUF loading.
vs. CodeQwen 1.5 7B (7B parameters): CodeQwen is smaller and faster, with broader language support (Python, JavaScript, Java, C++). Gemma 4 12B Coder has 70% more parameters and a 256K context window versus CodeQwen’s 32K. Choose Gemma 4 12B Coder for complex algorithmic Python tasks where reasoning transparency and long context matter. Choose CodeQwen for faster inference on simpler coding tasks or when you need multi-language support.
vs. DeepSeek-Coder 6.7B (6.7B parameters): DeepSeek-Coder is trained on a massive corpus of code across languages and has strong fill-in-the-middle capabilities. Gemma 4 12B Coder is more specialized—its training data is narrower but higher quality (execution-verified). The Gemma 4 model also has Apache 2.0 licensing versus DeepSeek’s custom license. Choose Gemma 4 12B Coder for permissive licensing and verified reasoning traces. Choose DeepSeek-Coder for broader language coverage and fill-in-the-middle workflows.
The tradeoff is specialization versus breadth. Gemma 4 12B Coder does one thing—Python algorithmic coding with transparent reasoning—and does it with verified training data. If that matches your workload, it is a strong choice at the 12B parameter class.

Explore the Family
The full Gemma family leaderboard with sizes, benchmark scores, and a release timeline.