yuxinlu1

Gemma 4 12B Coder

A community fine-tune of Google's open-weight gemma-4-12B-it, built by yuxinlu1 and shipped as GGUF for local inference. It was trained on execution-verified Python reasoning traces: real chains of thought from Composer 2.5 (only solutions that passed their tests were kept) plus a Fable 5 "second attempt" set covering the harder problems Composer 2.5 missed, both verified by running the code. The model reasons in the open before emitting a solution and is focused on Python and algorithmic coding. It keeps the base model's 256K context window and Apache 2.0 license.

12B paramsDense256K ctx

View on Hugging Face Official Page

Our Take

Best for: Workstation-class chat and code with usable context

A solid 12B-parameter dense language model from yuxinlu1. A pragmatic middle-ground choice when you need open weights without a flagship-sized footprint.

Run this onCorsair AI Workstation 300 (Ryzen AI Max 385)Cheapest card in our directory with comfortable headroom (48 GB) for this model at Q4 (~32.0 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Code Generation

Reasoning

Model Specifications

Parameters12B

ArchitectureDense

Context Length256K tokens

ModalityText Only

Provideryuxinlu1

Download Size75.4 GB

Community

Monthly Downloads495.8K

Likes2.3K

Last Updated7 days ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

No benchmark data available for this model yet.

MBA Open Score

68.7BB

Benchmark40%

82.0

Popularity25%

50.7

Efficiency20%

74.6

Versatility15%

55.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	29.5 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	32.0 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	33.2 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	34.7 GB	Excellent	Near-lossless quality with manageable size
Q8_0	37.7 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	49.1 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


NVIDIA A100 SXM4 80GBNVIDIA	SS	51.2 tok/s	32.0 GB
NVIDIA H100 SXM5 80GBNVIDIA	SS	84.2 tok/s	32.0 GB
Google Cloud TPU v5pGoogle	SS	69.5 tok/s	32.0 GB
Intel Gaudi 2 AI AcceleratorIntel	SS	61.6 tok/s	32.0 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	93.0 tok/s	32.0 GB
NVIDIA H200 SXM 141GBNVIDIA	SS	120.6 tok/s	32.0 GB
AMD Instinct MI300XAMD	SS	133.2 tok/s	32.0 GB
Google TPU v7 (Ironwood)Google	SS	185.4 tok/s	32.0 GB
NVIDIA B200 GPUNVIDIA	SS	201.0 tok/s	32.0 GB
AMD Instinct MI325XAMD	SS	150.8 tok/s	32.0 GB
AMD Instinct MI355XAMD	AA	201.0 tok/s	32.0 GB
ASUS ExpertCenter Pro ET900N G3ASUS	AA	178.4 tok/s	32.0 GB
Dell Pro Max with GB300Dell	AA	178.4 tok/s	32.0 GB
HP ZGX Fury AI StationHP	AA	178.4 tok/s	32.0 GB
MSI XpertStation WS300MSI	AA	178.4 tok/s	32.0 GB
SuperMicro Super AI StationSuperMicro	AA	178.4 tok/s	32.0 GB
Gigabyte W775-V10-L01Gigabyte	AA	178.4 tok/s	32.0 GB
NVIDIA RTX 6000 Ada GenerationNVIDIA	AA	24.1 tok/s	32.0 GB
Origin PC L-CLASS v2Origin PC	AA	24.1 tok/s	32.0 GB
NVIDIA L40SNVIDIA	AA	21.7 tok/s	32.0 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	AA	20.1 tok/s	32.0 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	20.1 tok/s	32.0 GB
Apple Mac Studio (M1 Max, 2022)Apple	BB	10.1 tok/s	32.0 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	BB	20.6 tok/s	32.0 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	15.4 tok/s	32.0 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on Apple M4 (~3.0 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Gemma 4 12B Coder on Apple M4 · ~3.0 tok/s · 25W	$0.276
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 32 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
AMD Instinct MI300XRunPod · Community · 192 GB VRAM	$0.50
NVIDIA H200 NVLRunPod · Community · 141 GB VRAM	$0.50
NVIDIA A100 80GB PCIeVast.ai · Spot · 80 GB VRAM	$0.62
NVIDIA A100 80GB PCIeVast.ai · On-Demand · 80 GB VRAM	$0.62
NVIDIA L40RunPod · Community · 48 GB VRAM	$0.69

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

Gemma 4 12B Coder is a community fine-tune of Google’s open-weight gemma-4-12B-it, built by developer yuxinlu1 and distributed as GGUF for local inference. This is a dense 12B-parameter text-only model focused on Python and algorithmic coding, with a training pipeline that prioritizes verifiable correctness over volume.

What distinguishes this model from the base Gemma 4 12B IT is the training data strategy. The fine-tune uses execution-verified reasoning traces: real chains of thought from Composer 2.5 where only solutions that passed their test suites were retained, plus a Fable 5 “second attempt” set covering the harder problems Composer 2.5 missed—again verified by running the code. The model reasons in the open before emitting a solution, making its decision process inspectable.

This model occupies the niche of small, local-first coding assistants that can run on consumer hardware. It competes with other 12B-class coding models like CodeQwen 1.5 7B and DeepSeek-Coder 6.7B, but brings the Gemma 4 architecture’s 256K context window and Apache 2.0 licensing. It is not a general-purpose assistant—it is purpose-built for Python coding tasks where reasoning transparency and solution correctness matter.

Architecture & Technical Details

Gemma 4 12B Coder uses a dense transformer architecture with 12 billion parameters. Unlike mixture-of-experts (MoE) models that activate only a subset of parameters per token, this is a dense model—all 12B parameters are active for every forward pass. This means consistent memory usage regardless of prompt complexity, and no routing overhead that can introduce latency variance.

The tradeoff is straightforward: dense models at this size require more VRAM than MoE models with comparable active parameter counts, but they deliver more predictable inference behavior and simpler deployment. For a 12B dense model, you are looking at approximately 6-7 GB of VRAM for a Q4 quant, scaling up to 12-13 GB for Q8 or FP16.

The context length is 256,000 tokens (262,144 positions), which is the full Gemma 4 specification. An earlier metadata bug in the upstream Gemma 4 release caused some quants to report only 131K context—this has been corrected in all current GGUF releases. The 256K window enables processing entire codebases, long documentation, or multi-file reasoning tasks without chunking.

The model is text-only. No multimodal capabilities, no vision, no audio. This is a deliberate constraint—every parameter is allocated to text reasoning and code generation.

Capabilities & Use Cases

Gemma 4 12B Coder is built for Python coding with an emphasis on algorithmic problem-solving. The training data consists entirely of function-level coding tasks with deterministic test suites, so the model is strongest at:

Algorithmic coding challenges: LeetCode-style problems, competitive programming tasks, data structure implementations
Python function generation: Writing self-contained functions with clear inputs and outputs
Code reasoning: Explaining approach, edge cases, and complexity before writing the solution
Debugging and fixing: The Fable 5 training set specifically targets problems where the teacher model failed, making the model more robust on hard cases

The model is not optimized for web development, API integration, or general-purpose chat. It will attempt those tasks, but its training distribution is heavily weighted toward algorithmic Python. If you need a model for Django, Flask, or front-end work, look elsewhere.

Concrete use cases:

Local coding assistant for Python development, especially algorithmic work
Code review companion that explains reasoning before suggesting fixes
Educational tool for learning Python algorithms and data structures
Offline problem-solving for competitive programming practice

Running Gemma 4 12B Coder Locally

This is where the model delivers its primary value: running on consumer hardware without cloud dependencies.

VRAM Requirements by Quantization

Quantization	VRAM Required	Quality Tradeoff
Q4_K_M	~4.5 GB	Good for most tasks, recommended default
Q5_K_M	~5.5 GB	Better quality, minor VRAM increase
Q8_0	~7.5 GB	Near-lossless, requires more VRAM
FP16 (safetensors)	~13 GB	Full precision, for fine-tuning or maximum quality

Consumer Hardware That Works

NVIDIA RTX 4090 (24 GB): Runs Q8_0 comfortably with room for context. Q4_K_M leaves headroom for large prompts.
NVIDIA RTX 3090 (24 GB): Same as 4090 for VRAM purposes. Expect slightly lower tokens/second.
NVIDIA RTX 4070 Ti (12 GB): Q4_K_M fits. Q5_K_M is tight but workable with smaller context.
Apple M4 Max (64 GB unified): Runs Q8_0 or FP16 easily. The unified memory architecture means no VRAM ceiling.
Apple M3 Pro (18 GB unified): Q4_K_M works. Q5_K_M may be tight depending on context length.
Apple M2 (8 GB unified): Q4_K_M is borderline. Use Q4_0 or Q3_K_M for headroom.

Expected Performance

On an RTX 4090 with Q4_K_M and 2048-token context, expect 40-60 tokens/second. On an M4 Max with Q8_0, expect 30-50 tokens/second. On an RTX 4070 Ti with Q4_K_M, expect 25-40 tokens/second. These numbers vary with prompt length, batch size, and backend configuration.

Quick Start with Ollama

The fastest way to run Gemma 4 12B Coder locally is through Ollama. The GGUF quants are available on Hugging Face and can be imported directly:

1ollama run hf.co/yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF:Q4_K_M

For llama.cpp directly, download the GGUF file and run:

1./llama-cli -m gemma-4-12B-coder-Q4_K_M.gguf -p "Your prompt here" -n 1024

The model also works with LM Studio, Jan, and any backend that supports GGUF loading.

How It Compares

vs. CodeQwen 1.5 7B (7B parameters): CodeQwen is smaller and faster, with broader language support (Python, JavaScript, Java, C++). Gemma 4 12B Coder has 70% more parameters and a 256K context window versus CodeQwen’s 32K. Choose Gemma 4 12B Coder for complex algorithmic Python tasks where reasoning transparency and long context matter. Choose CodeQwen for faster inference on simpler coding tasks or when you need multi-language support.

vs. DeepSeek-Coder 6.7B (6.7B parameters): DeepSeek-Coder is trained on a massive corpus of code across languages and has strong fill-in-the-middle capabilities. Gemma 4 12B Coder is more specialized—its training data is narrower but higher quality (execution-verified). The Gemma 4 model also has Apache 2.0 licensing versus DeepSeek’s custom license. Choose Gemma 4 12B Coder for permissive licensing and verified reasoning traces. Choose DeepSeek-Coder for broader language coverage and fill-in-the-middle workflows.

The tradeoff is specialization versus breadth. Gemma 4 12B Coder does one thing—Python algorithmic coding with transparent reasoning—and does it with verified training data. If that matches your workload, it is a strong choice at the 12B parameter class.

Explore the Family

See every Gemma release

The full Gemma family leaderboard with sizes, benchmark scores, and a release timeline.

Open Gemma

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

12B

yuxinlu1

Gemma 4 12B Coder

12B paramsDense256K ctx

View on Hugging Face Official Page

Our Take

Best for: Workstation-class chat and code with usable context

A solid 12B-parameter dense language model from yuxinlu1. A pragmatic middle-ground choice when you need open weights without a flagship-sized footprint.

Run this onCorsair AI Workstation 300 (Ryzen AI Max 385)Cheapest card in our directory with comfortable headroom (48 GB) for this model at Q4 (~32.0 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Code Generation

Reasoning

Model Specifications

Parameters12B

ArchitectureDense

Context Length256K tokens

ModalityText Only

Provideryuxinlu1

Download Size75.4 GB

Community

Monthly Downloads495.8K

Likes2.3K

Last Updated7 days ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

No benchmark data available for this model yet.

MBA Open Score

68.7BB

Benchmark40%

82.0

Popularity25%

50.7

Efficiency20%

74.6

Versatility15%

55.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	29.5 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	32.0 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	33.2 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	34.7 GB	Excellent	Near-lossless quality with manageable size
Q8_0	37.7 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	49.1 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


NVIDIA A100 SXM4 80GBNVIDIA	SS	51.2 tok/s	32.0 GB
NVIDIA H100 SXM5 80GBNVIDIA	SS	84.2 tok/s	32.0 GB
Google Cloud TPU v5pGoogle	SS	69.5 tok/s	32.0 GB
Intel Gaudi 2 AI AcceleratorIntel	SS	61.6 tok/s	32.0 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	93.0 tok/s	32.0 GB
NVIDIA H200 SXM 141GBNVIDIA	SS	120.6 tok/s	32.0 GB
AMD Instinct MI300XAMD	SS	133.2 tok/s	32.0 GB
Google TPU v7 (Ironwood)Google	SS	185.4 tok/s	32.0 GB
NVIDIA B200 GPUNVIDIA	SS	201.0 tok/s	32.0 GB
AMD Instinct MI325XAMD	SS	150.8 tok/s	32.0 GB
AMD Instinct MI355XAMD	AA	201.0 tok/s	32.0 GB
ASUS ExpertCenter Pro ET900N G3ASUS	AA	178.4 tok/s	32.0 GB
Dell Pro Max with GB300Dell	AA	178.4 tok/s	32.0 GB
HP ZGX Fury AI StationHP	AA	178.4 tok/s	32.0 GB
MSI XpertStation WS300MSI	AA	178.4 tok/s	32.0 GB
SuperMicro Super AI StationSuperMicro	AA	178.4 tok/s	32.0 GB
Gigabyte W775-V10-L01Gigabyte	AA	178.4 tok/s	32.0 GB
NVIDIA RTX 6000 Ada GenerationNVIDIA	AA	24.1 tok/s	32.0 GB
Origin PC L-CLASS v2Origin PC	AA	24.1 tok/s	32.0 GB
NVIDIA L40SNVIDIA	AA	21.7 tok/s	32.0 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	AA	20.1 tok/s	32.0 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	20.1 tok/s	32.0 GB
Apple Mac Studio (M1 Max, 2022)Apple	BB	10.1 tok/s	32.0 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	BB	20.6 tok/s	32.0 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	15.4 tok/s	32.0 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on Apple M4 (~3.0 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Gemma 4 12B Coder on Apple M4 · ~3.0 tok/s · 25W	$0.276
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 32 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
AMD Instinct MI300XRunPod · Community · 192 GB VRAM	$0.50
NVIDIA H200 NVLRunPod · Community · 141 GB VRAM	$0.50
NVIDIA A100 80GB PCIeVast.ai · Spot · 80 GB VRAM	$0.62
NVIDIA A100 80GB PCIeVast.ai · On-Demand · 80 GB VRAM	$0.62
NVIDIA L40RunPod · Community · 48 GB VRAM	$0.69

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

Architecture & Technical Details

The model is text-only. No multimodal capabilities, no vision, no audio. This is a deliberate constraint—every parameter is allocated to text reasoning and code generation.

Capabilities & Use Cases

Algorithmic coding challenges: LeetCode-style problems, competitive programming tasks, data structure implementations
Python function generation: Writing self-contained functions with clear inputs and outputs
Code reasoning: Explaining approach, edge cases, and complexity before writing the solution
Debugging and fixing: The Fable 5 training set specifically targets problems where the teacher model failed, making the model more robust on hard cases

Concrete use cases:

Local coding assistant for Python development, especially algorithmic work
Code review companion that explains reasoning before suggesting fixes
Educational tool for learning Python algorithms and data structures
Offline problem-solving for competitive programming practice

Running Gemma 4 12B Coder Locally

This is where the model delivers its primary value: running on consumer hardware without cloud dependencies.

VRAM Requirements by Quantization

Quantization	VRAM Required	Quality Tradeoff
Q4_K_M	~4.5 GB	Good for most tasks, recommended default
Q5_K_M	~5.5 GB	Better quality, minor VRAM increase
Q8_0	~7.5 GB	Near-lossless, requires more VRAM
FP16 (safetensors)	~13 GB	Full precision, for fine-tuning or maximum quality

Consumer Hardware That Works

NVIDIA RTX 4090 (24 GB): Runs Q8_0 comfortably with room for context. Q4_K_M leaves headroom for large prompts.
NVIDIA RTX 3090 (24 GB): Same as 4090 for VRAM purposes. Expect slightly lower tokens/second.
NVIDIA RTX 4070 Ti (12 GB): Q4_K_M fits. Q5_K_M is tight but workable with smaller context.
Apple M4 Max (64 GB unified): Runs Q8_0 or FP16 easily. The unified memory architecture means no VRAM ceiling.
Apple M3 Pro (18 GB unified): Q4_K_M works. Q5_K_M may be tight depending on context length.
Apple M2 (8 GB unified): Q4_K_M is borderline. Use Q4_0 or Q3_K_M for headroom.

Expected Performance

Quick Start with Ollama

The fastest way to run Gemma 4 12B Coder locally is through Ollama. The GGUF quants are available on Hugging Face and can be imported directly:

1ollama run hf.co/yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF:Q4_K_M

For llama.cpp directly, download the GGUF file and run:

1./llama-cli -m gemma-4-12B-coder-Q4_K_M.gguf -p "Your prompt here" -n 1024

The model also works with LM Studio, Jan, and any backend that supports GGUF loading.

How It Compares

Explore the Family

See every Gemma release

The full Gemma family leaderboard with sizes, benchmark scores, and a release timeline.

Open Gemma

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.