Google

Gemma 4 31B IT

Google's most capable open dense model from the Gemma 4 family, built from Gemini 3 research. 31B parameters with 256K context, configurable thinking modes, and multimodal support (text, image, video). Ranks #3 open model on Arena AI text leaderboard with estimated score of 1452. Fits on a single 80GB GPU in bfloat16.

31B paramsDense256K ctxText + Vision

View on Hugging Face

Run with Ollama Official Page

Our Take

Best for: Strongest at competition math (AIME 2026) in its size class

A strong 31B-parameter dense language model from Google. Pulls ahead on competition math (AIME 2026) (89/100), so reach for it when that's the dimension that matters.

Run this onAcer Veriton GN100 AI MiniCheapest card in our directory with comfortable headroom (128 GB) for this model at Q4 (~82.0 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Vision

Reasoning

Function Calling

Multilingual

Math

Instruction Following

Model Specifications

Parameters31B

ArchitectureDense

Context Length256K tokens

ModalityText + Vision

ProviderGoogle

Download Size175.0 GB

Community

Monthly Downloads10.2M

Likes2.7K

Last Updated4 days ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run gemma4:31b

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Gemma Terms of UseView Full License

Performance & Scoring

Benchmarks

84.3

85.2

19.5

89.2

91.3

Overall Score

71.2AA

Benchmark40%

73.9

Popularity25%

93.1

Efficiency20%

32.8

Versatility15%

79.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	75.5 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	82.0 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	85.1 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	88.8 GB	Excellent	Near-lossless quality with manageable size
Q8_0	96.5 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	126.0 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


NVIDIA H200 SXM 141GBNVIDIA	SS	47.1 tok/s	82.0 GB
AMD Instinct MI300XAMD	SS	52.1 tok/s	82.0 GB
Google TPU v7 (Ironwood)Google	SS	72.5 tok/s	82.0 GB
NVIDIA B200 GPUNVIDIA	SS	78.6 tok/s	82.0 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	36.3 tok/s	82.0 GB
AMD Instinct MI325XAMD	SS	58.9 tok/s	82.0 GB
AMD Instinct MI355XAMD	SS	78.6 tok/s	82.0 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	69.7 tok/s	82.0 GB
Dell Pro Max with GB300Dell	SS	69.7 tok/s	82.0 GB
HP ZGX Fury AI StationHP	SS	69.7 tok/s	82.0 GB
MSI XpertStation WS300MSI	SS	69.7 tok/s	82.0 GB
SuperMicro Super AI StationSuperMicro	SS	69.7 tok/s	82.0 GB
Gigabyte W775-V10-L01Gigabyte	SS	69.7 tok/s	82.0 GB
Google Cloud TPU v5pGoogle	AA	27.2 tok/s	82.0 GB
Intel Gaudi 2 AI AcceleratorIntel	AA	24.1 tok/s	82.0 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	BB	7.9 tok/s	82.0 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	6.0 tok/s	82.0 GB
MacBook Pro 16-inch M5 Max (2026)Apple	BB	6.0 tok/s	82.0 GB
MacBook Pro 16" M5 Max (2026)Apple	BB	6.0 tok/s	82.0 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	7.9 tok/s	82.0 GB
Apple M4 Max (40-core GPU)Apple	BB	5.4 tok/s	82.0 GB
Apple Mac Studio (M4 Max, 2025)Apple	BB	5.4 tok/s	82.0 GB
MacBook Pro 14-inch M4 Max (2024)Apple	BB	5.4 tok/s	82.0 GB
MacBook Pro 16" M4 Max (2024)Apple	BB	5.4 tok/s	82.0 GB
Acer Veriton GN100 AI MiniAcer	BB	2.7 tok/s	82.0 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on NVIDIA A100 SXM4 80GB (~20 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Gemma 4 31B IT on NVIDIA A100 SXM4 80GB · ~20 tok/s · 400W	$0.666
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 82 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
AMD Instinct MI300XRunPod · Community · 192 GB VRAM	$0.50
NVIDIA H200 NVLRunPod · Community · 141 GB VRAM	$0.50
NVIDIA H100 NVLVast.ai · Spot · 94 GB VRAM	$1.93
AMD Instinct MI300XRunPod · Secure · 192 GB VRAM	$1.99
AMD Instinct MI300XRunPod · Spot · 192 GB VRAM	$1.99

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Gemma 4 31B IT represents a significant shift in Google’s open-model strategy, bridging the gap between lightweight edge models and massive enterprise-grade LLMs. Built on the research foundations of Gemini 3, this is a dense 31B parameter model designed for high-reasoning tasks and complex multimodal workflows. Unlike previous iterations that focused heavily on text-only efficiency, the 31B IT variant is a native text-and-vision model, capable of processing images and video alongside text inputs.

Ranked #3 on the Arena AI text leaderboard with an estimated score of 1452, it competes directly with models twice its size. For local practitioners, the 31B parameter count represents a "sweet spot": it is small enough to fit on high-end consumer hardware with quantization, yet large enough to exhibit the sophisticated reasoning and instruction-following capabilities typically reserved for 70B+ models. It is specifically engineered for developers who need a reliable, local AI model with 31B parameters in 2025 that can handle function-calling and long-context retrieval without the latency of a cloud API.

Architecture & Technical Details

The Gemma 4 31B IT utilizes a dense transformer architecture. While many competitors are moving toward Mixture-of-Experts (MoE) to reduce compute costs, Google has opted for a dense 31B parameter structure here to maximize per-parameter intelligence and reasoning stability. This results in a more predictable memory footprint and consistent performance across diverse prompts.

One of the standout technical specs is the 256,000 token context length. This massive context window allows for the ingestion of entire codebases, long technical manuals, or hour-long video files. To support this, the model utilizes advanced attention mechanisms that maintain high retrieval accuracy (Needle In A Haystack performance) even at the upper limits of the window.

The model is natively multimodal. The vision encoder is integrated directly into the architecture, allowing it to "see" and reason about visual data without needing a separate adapter. It also features configurable thinking modes, which allow the model to allocate more compute to internal reasoning before generating a final response—a crucial feature for complex math and logic-heavy tasks.

Capabilities & Use Cases

Gemma 4 31B IT is a generalist with specialized performance in high-logic domains. Its instruction-following is precise, making it ideal for structured data extraction and complex function-calling where smaller models often fail to adhere to JSON schemas.

Advanced Coding & Software Engineering

Gemma 4 31B IT for coding excels in multi-file refactoring and bug hunting. Because it can hold 256K tokens in memory, you can feed it multiple source files to understand architectural dependencies. It supports dozens of programming languages with a particular strength in Python, Rust, and C++. Practitioners use it for local code completion and generating unit tests where data privacy prevents the use of hosted solutions.

Multimodal Reasoning & Video Analysis

The vision capabilities extend beyond simple OCR. It can analyze architectural diagrams, interpret complex data visualizations, and perform temporal reasoning on video files. This makes it a powerful tool for automated video summarization or visual debugging of UI/UX layouts.

Multilingual Logic and Math

The model's Gemma 4 31B IT reasoning benchmark scores are bolstered by its ability to handle "Chain of Thought" processing in multiple languages. It is particularly effective at solving graduate-level math problems and symbolic logic puzzles, making it a viable candidate for local RAG (Retrieval-Augmented Generation) systems that require high synthesis capabilities.

Running Gemma 4 31B IT Locally

To run Gemma 4 31B IT locally, your primary constraint will be VRAM. Because this is a dense 31B model, it requires significantly more memory than the Gemma 2 9B or Llama 3 8B, but it is much more accessible than 70B+ models.

Gemma 4 31B IT Hardware Requirements

The model in its native bfloat16 precision requires approximately 62GB of VRAM just for the weights, plus additional overhead for the KV cache. To run this comfortably, you generally have two paths: professional GPUs or aggressive quantization on consumer hardware.

Professional Setup: A single NVIDIA A100 (80GB) or H100 (80GB) can run the model at full precision with a substantial context window.
Consumer Setup (Multi-GPU): Two NVIDIA RTX 3090/4090s (48GB total VRAM) can run the model at 4-bit or 8-bit quantization with ease.
Mac Silicon: An M3 Max or M4 Max with 64GB or 128GB of Unified Memory is arguably the best GPU for Gemma 4 31B IT due to the high memory ceiling and unified architecture.

VRAM Requirements by Quantization

Choosing the best quantization for Gemma 4 31B IT involves balancing perplexity (intelligence) against speed.

Quantization	VRAM Required (Weights Only)	Recommended Hardware
FP16 / BF16	~62 GB	A100 80GB, Mac 96GB+
Q8_0	~33 GB	RTX 6000 Ada, Mac 64GB
Q4_K_M	~19 GB	RTX 3090/4090 24GB
Q2_K	~11 GB	RTX 3060 12GB / 4070 Ti Super 16GB

For most practitioners, Q4_K_M is the "goldilocks" quantization. It retains nearly all the model's intelligence while allowing it to fit on a single 24GB consumer GPU like the RTX 4090, leaving enough room for a 16k-32k context window.

Expected Performance

Gemma 4 31B IT tokens per second (t/s) will vary based on your hardware and backend (llama.cpp, vLLM, or MLX). On an RTX 4090 using 4-bit quantization, expect between 25-40 t/s. On Apple Silicon (M4 Max), you can expect similar speeds, though performance may degrade as the 256K context window fills up.

Deployment

Ollama is the fastest way to deploy this model. Once installed, you can run:

ollama run gemma4:31b

This will automatically handle the quantization and memory mapping for your specific hardware.

How It Compares

When evaluating Gemma 4 31B IT performance, it is most often compared to Llama 3.1 30B (if using unofficial merges) or Mistral-Large-2 (though Mistral is much larger at 123B).

Gemma 4 31B IT vs Llama 3.1 8B/70B

The 31B IT model occupies the "missing middle." It is vastly more capable than the Llama 3.1 8B in reasoning and multilingual tasks. Compared to the Llama 3.1 70B, Gemma 4 31B IT offers roughly 85-90% of the performance but is significantly easier to run on a single-GPU workstation. If your hardware cannot support the 140GB+ VRAM required for 70B models at high precision, the 31B is the logical choice.

Gemma 4 31B IT vs Qwen 2.5 32B

Qwen 2.5 32B is the closest direct competitor in terms of parameter count. While Qwen often leads in pure coding benchmarks and Chinese-language tasks, Gemma 4 31B IT generally offers superior vision capabilities and a more robust instruction-following "feel" for English-centric creative and logic tasks. Gemma's 256K context window also dwarfs the standard deployments of many other models in this size class, making it the preferred choice for long-document analysis.

Related Models

Google

Explore the Provider

See all Google models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Google model we track.

Open Google

Explore the Family

See every Gemma release

The full Gemma family leaderboard with sizes, benchmark scores, and a release timeline.

Open Gemma

Find the Best Hardware for This Model

Use our hardware calculator to find the optimal device for running this model.

Google

Gemma 4 31B IT

31B paramsDense256K ctxText + Vision

View on Hugging Face

Run with Ollama Official Page

Our Take

Best for: Strongest at competition math (AIME 2026) in its size class

A strong 31B-parameter dense language model from Google. Pulls ahead on competition math (AIME 2026) (89/100), so reach for it when that's the dimension that matters.

Run this onAcer Veriton GN100 AI MiniCheapest card in our directory with comfortable headroom (128 GB) for this model at Q4 (~82.0 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Vision

Reasoning

Function Calling

Multilingual

Math

Instruction Following

Model Specifications

Parameters31B

ArchitectureDense

Context Length256K tokens

ModalityText + Vision

ProviderGoogle

Download Size175.0 GB

Community

Monthly Downloads10.2M

Likes2.7K

Last Updated4 days ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run gemma4:31b

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Gemma Terms of UseView Full License

Performance & Scoring

Benchmarks

84.3

85.2

19.5

89.2

91.3

Overall Score

71.2AA

Benchmark40%

73.9

Popularity25%

93.1

Efficiency20%

32.8

Versatility15%

79.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	75.5 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	82.0 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	85.1 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	88.8 GB	Excellent	Near-lossless quality with manageable size
Q8_0	96.5 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	126.0 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


NVIDIA H200 SXM 141GBNVIDIA	SS	47.1 tok/s	82.0 GB
AMD Instinct MI300XAMD	SS	52.1 tok/s	82.0 GB
Google TPU v7 (Ironwood)Google	SS	72.5 tok/s	82.0 GB
NVIDIA B200 GPUNVIDIA	SS	78.6 tok/s	82.0 GB
Intel Gaudi 3 AI AcceleratorIntel	SS	36.3 tok/s	82.0 GB
AMD Instinct MI325XAMD	SS	58.9 tok/s	82.0 GB
AMD Instinct MI355XAMD	SS	78.6 tok/s	82.0 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	69.7 tok/s	82.0 GB
Dell Pro Max with GB300Dell	SS	69.7 tok/s	82.0 GB
HP ZGX Fury AI StationHP	SS	69.7 tok/s	82.0 GB
MSI XpertStation WS300MSI	SS	69.7 tok/s	82.0 GB
SuperMicro Super AI StationSuperMicro	SS	69.7 tok/s	82.0 GB
Gigabyte W775-V10-L01Gigabyte	SS	69.7 tok/s	82.0 GB
Google Cloud TPU v5pGoogle	AA	27.2 tok/s	82.0 GB
Intel Gaudi 2 AI AcceleratorIntel	AA	24.1 tok/s	82.0 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	BB	7.9 tok/s	82.0 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	6.0 tok/s	82.0 GB
MacBook Pro 16-inch M5 Max (2026)Apple	BB	6.0 tok/s	82.0 GB
MacBook Pro 16" M5 Max (2026)Apple	BB	6.0 tok/s	82.0 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	7.9 tok/s	82.0 GB
Apple M4 Max (40-core GPU)Apple	BB	5.4 tok/s	82.0 GB
Apple Mac Studio (M4 Max, 2025)Apple	BB	5.4 tok/s	82.0 GB
MacBook Pro 14-inch M4 Max (2024)Apple	BB	5.4 tok/s	82.0 GB
MacBook Pro 16" M4 Max (2024)Apple	BB	5.4 tok/s	82.0 GB
Acer Veriton GN100 AI MiniAcer	BB	2.7 tok/s	82.0 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on NVIDIA A100 SXM4 80GB (~20 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Gemma 4 31B IT on NVIDIA A100 SXM4 80GB · ~20 tok/s · 400W	$0.666
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 82 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
AMD Instinct MI300XRunPod · Community · 192 GB VRAM	$0.50
NVIDIA H200 NVLRunPod · Community · 141 GB VRAM	$0.50
NVIDIA H100 NVLVast.ai · Spot · 94 GB VRAM	$1.93
AMD Instinct MI300XRunPod · Secure · 192 GB VRAM	$1.99
AMD Instinct MI300XRunPod · Spot · 192 GB VRAM	$1.99

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Architecture & Technical Details

Capabilities & Use Cases

Advanced Coding & Software Engineering

Multimodal Reasoning & Video Analysis

Multilingual Logic and Math

Running Gemma 4 31B IT Locally

Gemma 4 31B IT Hardware Requirements

Professional Setup: A single NVIDIA A100 (80GB) or H100 (80GB) can run the model at full precision with a substantial context window.
Consumer Setup (Multi-GPU): Two NVIDIA RTX 3090/4090s (48GB total VRAM) can run the model at 4-bit or 8-bit quantization with ease.
Mac Silicon: An M3 Max or M4 Max with 64GB or 128GB of Unified Memory is arguably the best GPU for Gemma 4 31B IT due to the high memory ceiling and unified architecture.

VRAM Requirements by Quantization

Choosing the best quantization for Gemma 4 31B IT involves balancing perplexity (intelligence) against speed.

Quantization	VRAM Required (Weights Only)	Recommended Hardware
FP16 / BF16	~62 GB	A100 80GB, Mac 96GB+
Q8_0	~33 GB	RTX 6000 Ada, Mac 64GB
Q4_K_M	~19 GB	RTX 3090/4090 24GB
Q2_K	~11 GB	RTX 3060 12GB / 4070 Ti Super 16GB

Expected Performance

Deployment

Ollama is the fastest way to deploy this model. Once installed, you can run:

ollama run gemma4:31b

This will automatically handle the quantization and memory mapping for your specific hardware.

How It Compares

When evaluating Gemma 4 31B IT performance, it is most often compared to Llama 3.1 30B (if using unofficial merges) or Mistral-Large-2 (though Mistral is much larger at 123B).