NVIDIA

Nvidia Nemotron 3 Super

A 120B parameter Hybrid Mamba-Transformer model utilizing Latent MoE to provide a 1-million-token context and configurable reasoning modes.

120B paramsMoE1000K ctx

View on Hugging Face

Run with Ollama Official Page

Capabilities

Chat

Code Generation

Reasoning

Function Calling

Model Specifications

Parameters120B

Active Params12B

ArchitectureMoE

Context Length1M tokens

ModalityText Only

ProviderNVIDIA

Download Size247.2 GB

Community

Monthly Downloads616.1K

Likes338

Last Updated14 days ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run nemotron-3-super

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

NVIDIA Nemotron Open Model License

Performance & Scoring

Benchmarks

GPQA

79.2

MMLU-PRO

83.7

SWE-Verified

53.7

HLE

18.3

AIME 2026

90.0

Terminal Bench

31.0

HMMT 2026

84.8

Arena Score

81.7

Overall Score

47.9CC

Benchmark40%

65.3

Popularity25%

47.2

Efficiency20%

14.8

Versatility15%

47.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	100.9 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	103.5 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	104.7 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	106.1 GB	Excellent	Near-lossless quality with manageable size
Q8_0	109.1 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	120.5 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

83 devices


AMD Instinct MI300XAMD	SS	41.2 tok/s	103.5 GB
NVIDIA B200 GPUNVIDIA	SS	62.2 tok/s	103.5 GB
AMD Instinct MI325XAMD	SS	46.7 tok/s	103.5 GB
NVIDIA H200 SXM 141GBNVIDIA	SS	37.3 tok/s	103.5 GB
AMD Instinct MI355XAMD	SS	62.2 tok/s	103.5 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	55.2 tok/s	103.5 GB
Dell Pro Max with GB300Dell	SS	55.2 tok/s	103.5 GB
HP ZGX Fury AI StationHP	SS	55.2 tok/s	103.5 GB
MSI XpertStation WS300MSI	SS	55.2 tok/s	103.5 GB
SuperMicro Super AI StationSuperMicro	SS	55.2 tok/s	103.5 GB
Gigabyte W775-V10-L01Gigabyte	SS	55.2 tok/s	103.5 GB
Intel Gaudi 3 AI AcceleratorIntel	AA	28.8 tok/s	103.5 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	6.2 tok/s	103.5 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	BB	6.4 tok/s	103.5 GB
Apple Mac Studio (M3 Ultra, 2025)Apple	BB	6.4 tok/s	103.5 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	BB	6.2 tok/s	103.5 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	4.8 tok/s	103.5 GB
MacBook Pro 16-inch M5 Max (2026)Apple	BB	4.8 tok/s	103.5 GB
MacBook Pro 16" M5 Max (2026)Apple	BB	4.8 tok/s	103.5 GB
Apple M4 Max (40-core GPU)Apple	BB	4.2 tok/s	103.5 GB
Apple Mac Studio (M4 Max, 2025)Apple	BB	4.2 tok/s	103.5 GB
MacBook Pro 14-inch M4 Max (2024)Apple	BB	4.2 tok/s	103.5 GB
MacBook Pro 16" M4 Max (2024)Apple	BB	4.2 tok/s	103.5 GB
Acer Veriton GN100 AI MiniAcer	BB	2.1 tok/s	103.5 GB
ASUS Ascent GX10 - 1TBASUS	BB	2.1 tok/s	103.5 GB

Rows per page

Page 1 of 4

About This Model

Nvidia Nemotron 3 Super is a 120-billion parameter Large Language Model (LLM) designed for high-throughput agentic reasoning and complex multi-step workflows. Released by NVIDIA, it occupies a strategic middle ground in the local AI landscape: it provides the reasoning depth of a 120B+ parameter model while utilizing a "Latent MoE" (Mixture of Experts) architecture that only activates 12 billion parameters per token. This allows it to compete directly with massive dense models like GPT-OSS-120B and Qwen3.5-122B while delivering significantly higher tokens-per-second on local hardware.

The model is specifically optimized for "agentic" workloads—tasks where an AI must use tools, call functions, and maintain coherence over extremely long sessions. With a native context window of 1,000,000 tokens, Nemotron 3 Super is built to ingest entire codebases or massive document sets without the "goal drift" often seen in smaller models. For developers running models on their own machines, it represents one of the most efficient ways to achieve frontier-level reasoning without requiring a multi-node server cluster.

Architecture and Technical Specs

The defining characteristic of Nemotron 3 Super is its hybrid architecture. Unlike standard Transformers, it combines Mamba-2 (State Space Model) layers with traditional Attention mechanisms and a Mixture-of-Experts (MoE) routing system.

Latent MoE and Hybrid Design

Total Parameters: 120B
Active Parameters: 12B
Architecture: Hybrid Mamba-Transformer MoE
Training Data: 25 trillion tokens
Inference Acceleration: Native Multi-Token Prediction (MTP) layers for speculative decoding.

The use of Latent MoE is a departure from standard MoE implementations. It optimizes for both accuracy per FLOP and accuracy per parameter, ensuring that the 12B active parameters punch well above their weight class. By integrating Mamba-2 layers, the model handles long sequences more efficiently than pure Transformer models, as Mamba's linear scaling reduces the computational overhead of the 1-million-token context window.

Multi-Token Prediction (MTP)

Nemotron 3 Super includes native MTP layers. For local practitioners, this is a major advantage: it enables built-in speculative decoding. This allows the model to predict multiple tokens in a single forward pass, resulting in a 2.2x to 7.5x throughput increase compared to similarly sized dense models when running on NVIDIA hardware.

Capabilities and Use Cases

Nemotron 3 Super is a text-only model tuned for high-logic tasks. It is not a general-purpose "creative writing" model but rather a functional tool for engineers and researchers.

Agentic Reasoning and Function-Calling

The model is specifically post-trained for agentic workflows using NVIDIA’s Nemotron-post-training-v3 datasets. It excels at:

Complex Tool Use: Correcting its own mistakes when a function call fails or returns unexpected data.
Multi-Step Planning: Breaking down a high-level prompt into a sequence of executable sub-tasks.
Coherence: Maintaining the original objective across a 15x increase in token volume typical of multi-agent "chatter."

Technical Coding and Reasoning

With its 120B scale, the model shows high proficiency in Python, C++, and CUDA programming. Because it was pre-trained in NVFP4 (NVIDIA's 4-bit floating point format), the base weights are already optimized for high-precision logic even at lower bit-rates. It is a primary candidate for local RAG (Retrieval-Augmented Generation) over large repositories where the 1M context window allows you to skip complex chunking strategies and simply feed the model the relevant files.

Running Nvidia Nemotron 3 Super Locally

Running a 120B model locally is a significant hardware undertaking. Even though only 12B parameters are active during inference, the entire 120B parameter set must reside in VRAM (or system RAM for GGUF/Mac users) to avoid massive "offloading" bottlenecks.

Hardware Requirements and VRAM

To run Nemotron 3 Super, your primary constraint is the footprint of the weights.

| Quantization | Recommended VRAM | Hardware Target |

| :--- | :--- | :--- |

| BF16 (Unquantized) | ~240 GB | 3x A100 (80GB) or 4x H100 |

| FP8 / NVFP4 | ~130 GB | 2x A100 (80GB) or Mac Studio M2/M4 Ultra (192GB) |

| Q4_K_M (GGUF) | ~75-80 GB | 2x RTX 6000 Ada or Mac Studio (128GB+) |

| Q3_K_L (GGUF) | ~60-65 GB | 3x RTX 3090/4090 (24GB) via NVLink/P2P |

Best Quantization for Local Use

For most practitioners, Q4_K_M is the "sweet spot." It preserves nearly all the reasoning capabilities of the BF16 original while fitting into the ~80GB VRAM pool common in professional workstations.

If you are trying to run a 120B model on consumer GPUs, you will need at least three RTX 3090 or 4090 cards. Using Ollama or llama.cpp with 4-bit quantization will allow you to split the model across these cards. On Apple Silicon, an M2 or M3 Ultra with at least 128GB of Unified Memory is the ideal environment for this model, providing enough headroom for the 1M token context window.

Performance Expectations

On a dual A100 setup using FP8 quantization, expect high throughput (50+ tokens/sec) due to the MTP layers. On consumer 4090 clusters using GGUF, performance will likely settle between 5-15 tokens/sec depending on the interconnect speed (PCIe Gen4 vs. Gen5).

How It Compares

Nemotron 3 Super enters a competitive field of "large-but-efficient" models.

Nvidia Nemotron 3 Super vs. Qwen 3.5 122B

Qwen 3.5 122B is a dense model, meaning every parameter is active for every token. While Qwen may offer slightly higher raw knowledge density, Nemotron 3 Super is significantly faster in local inference. In NVIDIA's own benchmarks, Nemotron achieves up to 7.5x higher throughput in long-context scenarios compared to Qwen. If your application involves high-volume token generation (like autonomous agents), Nemotron is the clear winner.

Nvidia Nemotron 3 Super vs. DeepSeek-V3

DeepSeek-V3 is a much larger MoE (671B total parameters). While DeepSeek-V3 is arguably the current state-of-the-art for open-weights reasoning, its VRAM requirements are astronomical compared to Nemotron. Nemotron 3 Super provides "frontier-adjacent" reasoning while remaining small enough to fit on a single high-end workstation or a small Mac Studio, whereas DeepSeek-V3 requires a dedicated server rack.

Why Choose Nemotron 3 Super?

Choose this model if you need a "reasoning backbone" for a local agentic system where speed and context length are more important than world-knowledge trivia. It is the most technically advanced hybrid model currently available for local deployment that can realistically handle a million-token context window without collapsing into gibberish.

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.

NVIDIA

Nvidia Nemotron 3 Super

A 120B parameter Hybrid Mamba-Transformer model utilizing Latent MoE to provide a 1-million-token context and configurable reasoning modes.

120B paramsMoE1000K ctx

View on Hugging Face

Run with Ollama Official Page

Capabilities

Chat

Code Generation

Reasoning

Function Calling

Model Specifications

Parameters120B

Active Params12B

ArchitectureMoE

Context Length1M tokens

ModalityText Only

ProviderNVIDIA

Download Size247.2 GB

Community

Monthly Downloads616.1K

Likes338

Last Updated14 days ago

Quick Start

Run with Ollama

Copy and paste this command to start running the model locally.

ollama run nemotron-3-super

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

NVIDIA Nemotron Open Model License

Performance & Scoring

Benchmarks

GPQA

79.2

MMLU-PRO

83.7

SWE-Verified

53.7

HLE

18.3

AIME 2026

90.0

Terminal Bench

31.0

HMMT 2026

84.8

Arena Score

81.7

Overall Score

47.9CC

Benchmark40%

65.3

Popularity25%

47.2

Efficiency20%

14.8

Versatility15%

47.0

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	100.9 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	103.5 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	104.7 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	106.1 GB	Excellent	Near-lossless quality with manageable size
Q8_0	109.1 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	120.5 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

83 devices


AMD Instinct MI300XAMD	SS	41.2 tok/s	103.5 GB
NVIDIA B200 GPUNVIDIA	SS	62.2 tok/s	103.5 GB
AMD Instinct MI325XAMD	SS	46.7 tok/s	103.5 GB
NVIDIA H200 SXM 141GBNVIDIA	SS	37.3 tok/s	103.5 GB
AMD Instinct MI355XAMD	SS	62.2 tok/s	103.5 GB
ASUS ExpertCenter Pro ET900N G3ASUS	SS	55.2 tok/s	103.5 GB
Dell Pro Max with GB300Dell	SS	55.2 tok/s	103.5 GB
HP ZGX Fury AI StationHP	SS	55.2 tok/s	103.5 GB
MSI XpertStation WS300MSI	SS	55.2 tok/s	103.5 GB
SuperMicro Super AI StationSuperMicro	SS	55.2 tok/s	103.5 GB
Gigabyte W775-V10-L01Gigabyte	SS	55.2 tok/s	103.5 GB
Intel Gaudi 3 AI AcceleratorIntel	AA	28.8 tok/s	103.5 GB
Apple Mac Studio (M2 Ultra, 2023)Apple	BB	6.2 tok/s	103.5 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	BB	6.4 tok/s	103.5 GB
Apple Mac Studio (M3 Ultra, 2025)Apple	BB	6.4 tok/s	103.5 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	BB	6.2 tok/s	103.5 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	BB	4.8 tok/s	103.5 GB
MacBook Pro 16-inch M5 Max (2026)Apple	BB	4.8 tok/s	103.5 GB
MacBook Pro 16" M5 Max (2026)Apple	BB	4.8 tok/s	103.5 GB
Apple M4 Max (40-core GPU)Apple	BB	4.2 tok/s	103.5 GB
Apple Mac Studio (M4 Max, 2025)Apple	BB	4.2 tok/s	103.5 GB
MacBook Pro 14-inch M4 Max (2024)Apple	BB	4.2 tok/s	103.5 GB
MacBook Pro 16" M4 Max (2024)Apple	BB	4.2 tok/s	103.5 GB
Acer Veriton GN100 AI MiniAcer	BB	2.1 tok/s	103.5 GB
ASUS Ascent GX10 - 1TBASUS	BB	2.1 tok/s	103.5 GB

Rows per page

Page 1 of 4

About This Model

Architecture and Technical Specs

Latent MoE and Hybrid Design

Total Parameters: 120B
Active Parameters: 12B
Architecture: Hybrid Mamba-Transformer MoE
Training Data: 25 trillion tokens
Inference Acceleration: Native Multi-Token Prediction (MTP) layers for speculative decoding.

Multi-Token Prediction (MTP)

Capabilities and Use Cases

Nemotron 3 Super is a text-only model tuned for high-logic tasks. It is not a general-purpose "creative writing" model but rather a functional tool for engineers and researchers.

Agentic Reasoning and Function-Calling

The model is specifically post-trained for agentic workflows using NVIDIA’s Nemotron-post-training-v3 datasets. It excels at:

Complex Tool Use: Correcting its own mistakes when a function call fails or returns unexpected data.
Multi-Step Planning: Breaking down a high-level prompt into a sequence of executable sub-tasks.
Coherence: Maintaining the original objective across a 15x increase in token volume typical of multi-agent "chatter."

Technical Coding and Reasoning

Running Nvidia Nemotron 3 Super Locally

Hardware Requirements and VRAM

To run Nemotron 3 Super, your primary constraint is the footprint of the weights.

| Quantization | Recommended VRAM | Hardware Target |

| :--- | :--- | :--- |

| BF16 (Unquantized) | ~240 GB | 3x A100 (80GB) or 4x H100 |

| FP8 / NVFP4 | ~130 GB | 2x A100 (80GB) or Mac Studio M2/M4 Ultra (192GB) |