
A 120B parameter Hybrid Mamba-Transformer model utilizing Latent MoE to provide a 1-million-token context and configurable reasoning modes.
Copy and paste this command to start running the model locally.
ollama run nemotron-3-superAccess model weights, configuration files, and documentation.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 100.9 GB | Low | |
| Q4_K_MRecommended | 103.5 GB | Good | |
| Q5_K_M | 104.7 GB | Very Good | |
| Q6_K | 106.1 GB | Excellent | |
| Q8_0 | 109.1 GB | Near Perfect | |
| FP16 | 120.5 GB | Full |
See which devices can run this model and at what quality level.
| SS | 41.2 tok/s | 103.5 GB | ||
NVIDIA B200 GPUNVIDIA | SS | 62.2 tok/s | 103.5 GB | |
| SS | 46.7 tok/s | 103.5 GB | ||
NVIDIA H200 SXM 141GBNVIDIA | SS | 37.3 tok/s | 103.5 GB | |
| SS | 62.2 tok/s | 103.5 GB | ||
| SS | 55.2 tok/s | 103.5 GB | ||
| SS | 55.2 tok/s | 103.5 GB | ||
| SS | 55.2 tok/s | 103.5 GB | ||
| SS | 55.2 tok/s | 103.5 GB | ||
SuperMicro Super AI StationSuperMicro | SS | 55.2 tok/s | 103.5 GB | |
Gigabyte W775-V10-L01Gigabyte | SS | 55.2 tok/s | 103.5 GB | |
| AA | 28.8 tok/s | 103.5 GB | ||
| BB | 6.2 tok/s | 103.5 GB | ||
| BB | 6.4 tok/s | 103.5 GB | ||
| BB | 6.4 tok/s | 103.5 GB | ||
| BB | 6.2 tok/s | 103.5 GB | ||
| BB | 4.8 tok/s | 103.5 GB | ||
| BB | 4.8 tok/s | 103.5 GB | ||
| BB | 4.8 tok/s | 103.5 GB | ||
| BB | 4.2 tok/s | 103.5 GB | ||
| BB | 4.2 tok/s | 103.5 GB | ||
| BB | 4.2 tok/s | 103.5 GB | ||
| BB | 4.2 tok/s | 103.5 GB | ||
| BB | 2.1 tok/s | 103.5 GB | ||
| BB | 2.1 tok/s | 103.5 GB |
Nvidia Nemotron 3 Super is a 120-billion parameter Large Language Model (LLM) designed for high-throughput agentic reasoning and complex multi-step workflows. Released by NVIDIA, it occupies a strategic middle ground in the local AI landscape: it provides the reasoning depth of a 120B+ parameter model while utilizing a "Latent MoE" (Mixture of Experts) architecture that only activates 12 billion parameters per token. This allows it to compete directly with massive dense models like GPT-OSS-120B and Qwen3.5-122B while delivering significantly higher tokens-per-second on local hardware.
The model is specifically optimized for "agentic" workloads—tasks where an AI must use tools, call functions, and maintain coherence over extremely long sessions. With a native context window of 1,000,000 tokens, Nemotron 3 Super is built to ingest entire codebases or massive document sets without the "goal drift" often seen in smaller models. For developers running models on their own machines, it represents one of the most efficient ways to achieve frontier-level reasoning without requiring a multi-node server cluster.
The defining characteristic of Nemotron 3 Super is its hybrid architecture. Unlike standard Transformers, it combines Mamba-2 (State Space Model) layers with traditional Attention mechanisms and a Mixture-of-Experts (MoE) routing system.
The use of Latent MoE is a departure from standard MoE implementations. It optimizes for both accuracy per FLOP and accuracy per parameter, ensuring that the 12B active parameters punch well above their weight class. By integrating Mamba-2 layers, the model handles long sequences more efficiently than pure Transformer models, as Mamba's linear scaling reduces the computational overhead of the 1-million-token context window.
Nemotron 3 Super includes native MTP layers. For local practitioners, this is a major advantage: it enables built-in speculative decoding. This allows the model to predict multiple tokens in a single forward pass, resulting in a 2.2x to 7.5x throughput increase compared to similarly sized dense models when running on NVIDIA hardware.
Nemotron 3 Super is a text-only model tuned for high-logic tasks. It is not a general-purpose "creative writing" model but rather a functional tool for engineers and researchers.
The model is specifically post-trained for agentic workflows using NVIDIA’s Nemotron-post-training-v3 datasets. It excels at:
With its 120B scale, the model shows high proficiency in Python, C++, and CUDA programming. Because it was pre-trained in NVFP4 (NVIDIA's 4-bit floating point format), the base weights are already optimized for high-precision logic even at lower bit-rates. It is a primary candidate for local RAG (Retrieval-Augmented Generation) over large repositories where the 1M context window allows you to skip complex chunking strategies and simply feed the model the relevant files.
Running a 120B model locally is a significant hardware undertaking. Even though only 12B parameters are active during inference, the entire 120B parameter set must reside in VRAM (or system RAM for GGUF/Mac users) to avoid massive "offloading" bottlenecks.
To run Nemotron 3 Super, your primary constraint is the footprint of the weights.
| Quantization | Recommended VRAM | Hardware Target |
| :--- | :--- | :--- |
| BF16 (Unquantized) | ~240 GB | 3x A100 (80GB) or 4x H100 |
| FP8 / NVFP4 | ~130 GB | 2x A100 (80GB) or Mac Studio M2/M4 Ultra (192GB) |
| Q4_K_M (GGUF) | ~75-80 GB | 2x RTX 6000 Ada or Mac Studio (128GB+) |
| Q3_K_L (GGUF) | ~60-65 GB | 3x RTX 3090/4090 (24GB) via NVLink/P2P |
For most practitioners, Q4_K_M is the "sweet spot." It preserves nearly all the reasoning capabilities of the BF16 original while fitting into the ~80GB VRAM pool common in professional workstations.
If you are trying to run a 120B model on consumer GPUs, you will need at least three RTX 3090 or 4090 cards. Using Ollama or llama.cpp with 4-bit quantization will allow you to split the model across these cards. On Apple Silicon, an M2 or M3 Ultra with at least 128GB of Unified Memory is the ideal environment for this model, providing enough headroom for the 1M token context window.
On a dual A100 setup using FP8 quantization, expect high throughput (50+ tokens/sec) due to the MTP layers. On consumer 4090 clusters using GGUF, performance will likely settle between 5-15 tokens/sec depending on the interconnect speed (PCIe Gen4 vs. Gen5).
Nemotron 3 Super enters a competitive field of "large-but-efficient" models.
Qwen 3.5 122B is a dense model, meaning every parameter is active for every token. While Qwen may offer slightly higher raw knowledge density, Nemotron 3 Super is significantly faster in local inference. In NVIDIA's own benchmarks, Nemotron achieves up to 7.5x higher throughput in long-context scenarios compared to Qwen. If your application involves high-volume token generation (like autonomous agents), Nemotron is the clear winner.
DeepSeek-V3 is a much larger MoE (671B total parameters). While DeepSeek-V3 is arguably the current state-of-the-art for open-weights reasoning, its VRAM requirements are astronomical compared to Nemotron. Nemotron 3 Super provides "frontier-adjacent" reasoning while remaining small enough to fit on a single high-end workstation or a small Mac Studio, whereas DeepSeek-V3 requires a dedicated server rack.
Choose this model if you need a "reasoning backbone" for a local agentic system where speed and context length are more important than world-knowledge trivia. It is the most technically advanced hybrid model currently available for local deployment that can realistically handle a million-token context window without collapsing into gibberish.