made by agents

Google's most capable open dense model from the Gemma 4 family, built from Gemini 3 research. 31B parameters with 256K context, configurable thinking modes, and multimodal support (text, image, video). Ranks #3 open model on Arena AI text leaderboard with estimated score of 1452. Fits on a single 80GB GPU in bfloat16.
Copy and paste this command to start running the model locally.
ollama run gemma4:31bAccess model weights, configuration files, and documentation.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 75.5 GB | Low | Aggressive quantization — smallest size, noticeable quality loss |
| Q4_K_MRecommended | 82.0 GB | Good | Best balance of size and quality for most use-cases |
| Q5_K_M | 85.1 GB | Very Good | Slightly better quality than Q4 with moderate size increase |
| Q6_K | 88.8 GB | Excellent | Near-lossless quality with manageable size |
| Q8_0 | 96.5 GB | Near Perfect | Virtually indistinguishable from full precision |
| FP16 | 126.0 GB | Full | Full 16-bit floating point — maximum quality, largest size |
See which devices can run this model and at what quality level.
NVIDIA H200 SXM 141GBNVIDIA | SS | 47.1 tok/s | 82.0 GB | |
| SS | 52.1 tok/s | 82.0 GB | ||
NVIDIA B200 GPUNVIDIA | SS | 78.6 tok/s | 82.0 GB | |
| SS | 36.3 tok/s | 82.0 GB | ||
| SS | 58.9 tok/s | 82.0 GB | ||
| SS | 78.6 tok/s | 82.0 GB | ||
Google Cloud TPU v5pGoogle | AA | 27.2 tok/s | 82.0 GB | |
| AA | 24.1 tok/s | 82.0 GB | ||
| BB | 7.9 tok/s | 82.0 GB | ||
| BB | 6.0 tok/s | 82.0 GB | ||
| BB | 6.0 tok/s | 82.0 GB | ||
| BB | 6.0 tok/s | 82.0 GB | ||
| BB | 7.9 tok/s | 82.0 GB | ||
| BB | 5.4 tok/s | 82.0 GB | ||
| BB | 5.4 tok/s | 82.0 GB | ||
| BB | 5.4 tok/s | 82.0 GB | ||
| BB | 5.4 tok/s | 82.0 GB | ||
| BB | 2.7 tok/s | 82.0 GB | ||
| BB | 8.0 tok/s | 82.0 GB | ||
| BB | 8.0 tok/s | 82.0 GB | ||
NVIDIA H100 SXM5 80GBNVIDIA | BB | 32.9 tok/s | 82.0 GB | |
| BB | 3.9 tok/s | 82.0 GB | ||
NVIDIA A100 SXM4 80GBNVIDIA | CC | 20.0 tok/s | 82.0 GB | |
| FF | 2.8 tok/s | 82.0 GB | ||
| FF | 4.2 tok/s | 82.0 GB |
Gemma 4 31B IT represents a significant shift in Google’s open-model strategy, bridging the gap between lightweight edge models and massive enterprise-grade LLMs. Built on the research foundations of Gemini 3, this is a dense 31B parameter model designed for high-reasoning tasks and complex multimodal workflows. Unlike previous iterations that focused heavily on text-only efficiency, the 31B IT variant is a native text-and-vision model, capable of processing images and video alongside text inputs.
Ranked #3 on the Arena AI text leaderboard with an estimated score of 1452, it competes directly with models twice its size. For local practitioners, the 31B parameter count represents a "sweet spot": it is small enough to fit on high-end consumer hardware with quantization, yet large enough to exhibit the sophisticated reasoning and instruction-following capabilities typically reserved for 70B+ models. It is specifically engineered for developers who need a reliable, local AI model with 31B parameters in 2025 that can handle function-calling and long-context retrieval without the latency of a cloud API.
The Gemma 4 31B IT utilizes a dense transformer architecture. While many competitors are moving toward Mixture-of-Experts (MoE) to reduce compute costs, Google has opted for a dense 31B parameter structure here to maximize per-parameter intelligence and reasoning stability. This results in a more predictable memory footprint and consistent performance across diverse prompts.
One of the standout technical specs is the 256,000 token context length. This massive context window allows for the ingestion of entire codebases, long technical manuals, or hour-long video files. To support this, the model utilizes advanced attention mechanisms that maintain high retrieval accuracy (Needle In A Haystack performance) even at the upper limits of the window.
The model is natively multimodal. The vision encoder is integrated directly into the architecture, allowing it to "see" and reason about visual data without needing a separate adapter. It also features configurable thinking modes, which allow the model to allocate more compute to internal reasoning before generating a final response—a crucial feature for complex math and logic-heavy tasks.
Gemma 4 31B IT is a generalist with specialized performance in high-logic domains. Its instruction-following is precise, making it ideal for structured data extraction and complex function-calling where smaller models often fail to adhere to JSON schemas.
Gemma 4 31B IT for coding excels in multi-file refactoring and bug hunting. Because it can hold 256K tokens in memory, you can feed it multiple source files to understand architectural dependencies. It supports dozens of programming languages with a particular strength in Python, Rust, and C++. Practitioners use it for local code completion and generating unit tests where data privacy prevents the use of hosted solutions.
The vision capabilities extend beyond simple OCR. It can analyze architectural diagrams, interpret complex data visualizations, and perform temporal reasoning on video files. This makes it a powerful tool for automated video summarization or visual debugging of UI/UX layouts.
The model's Gemma 4 31B IT reasoning benchmark scores are bolstered by its ability to handle "Chain of Thought" processing in multiple languages. It is particularly effective at solving graduate-level math problems and symbolic logic puzzles, making it a viable candidate for local RAG (Retrieval-Augmented Generation) systems that require high synthesis capabilities.
To run Gemma 4 31B IT locally, your primary constraint will be VRAM. Because this is a dense 31B model, it requires significantly more memory than the Gemma 2 9B or Llama 3 8B, but it is much more accessible than 70B+ models.
The model in its native bfloat16 precision requires approximately 62GB of VRAM just for the weights, plus additional overhead for the KV cache. To run this comfortably, you generally have two paths: professional GPUs or aggressive quantization on consumer hardware.
Choosing the best quantization for Gemma 4 31B IT involves balancing perplexity (intelligence) against speed.
| Quantization | VRAM Required (Weights Only) | Recommended Hardware |
| :--- | :--- | :--- |
| FP16 / BF16 | ~62 GB | A100 80GB, Mac 96GB+ |
| Q8_0 | ~33 GB | RTX 6000 Ada, Mac 64GB |
| Q4_K_M | ~19 GB | RTX 3090/4090 24GB |
| Q2_K | ~11 GB | RTX 3060 12GB / 4070 Ti Super 16GB |
For most practitioners, Q4_K_M is the "goldilocks" quantization. It retains nearly all the model's intelligence while allowing it to fit on a single 24GB consumer GPU like the RTX 4090, leaving enough room for a 16k-32k context window.
Gemma 4 31B IT tokens per second (t/s) will vary based on your hardware and backend (llama.cpp, vLLM, or MLX). On an RTX 4090 using 4-bit quantization, expect between 25-40 t/s. On Apple Silicon (M4 Max), you can expect similar speeds, though performance may degrade as the 256K context window fills up.
Ollama is the fastest way to deploy this model. Once installed, you can run:
ollama run gemma4:31b
This will automatically handle the quantization and memory mapping for your specific hardware.
When evaluating Gemma 4 31B IT performance, it is most often compared to Llama 3.1 30B (if using unofficial merges) or Mistral-Large-2 (though Mistral is much larger at 123B).
The 31B IT model occupies the "missing middle." It is vastly more capable than the Llama 3.1 8B in reasoning and multilingual tasks. Compared to the Llama 3.1 70B, Gemma 4 31B IT offers roughly 85-90% of the performance but is significantly easier to run on a single-GPU workstation. If your hardware cannot support the 140GB+ VRAM required for 70B models at high precision, the 31B is the logical choice.
Qwen 2.5 32B is the closest direct competitor in terms of parameter count. While Qwen often leads in pure coding benchmarks and Chinese-language tasks, Gemma 4 31B IT generally offers superior vision capabilities and a more robust instruction-following "feel" for English-centric creative and logic tasks. Gemma's 256K context window also dwarfs the standard deployments of many other models in this size class, making it the preferred choice for long-document analysis.