NVIDIA

Nemotron 3 Ultra

NVIDIA's largest open-weight Nemotron model, a Latent Mixture-of-Experts design with 550B total parameters and 55B active per token. It uses a hybrid architecture that interleaves Mamba-2, MoE, and select attention layers, adds Multi-Token Prediction for faster generation, and supports a 1M-token context window. It is text-only, was pre-trained on data through September 2025 with post-training through May 2026, and ships under the OpenMDW License. On benchmarks it scores 86.8 on MMLU-Pro, 70.7 on SWE-Bench Verified, 89.0 on LiveCodeBench v6, and 94.7 on RULER at 1M context, with an Artificial Analysis Intelligence Index of 47.7.

550B paramsMoE1000K ctx

View on Hugging Face Official Page

Our Take

Best for: Strongest at graduate-level reasoning (GPQA) in its size class

A workable 550B-parameter MoE language model from NVIDIA. Pulls ahead on graduate-level reasoning (GPQA) (87/100), so reach for it when that's the dimension that matters. Newly released, so production-readiness is still being shaken out.

Run this onASUS ExpertCenter Pro ET900N G3Cheapest card in our directory with comfortable headroom (748 GB) for this model at Q4 (~472.4 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Reasoning

Function Calling

Multilingual

Math

Instruction Following

Model Specifications

Parameters550B

Active Params55B

ArchitectureMoE

Context Length1M tokens

ModalityText Only

Training Cutoff2025-09

ProviderNVIDIA

Download Size2.2 TB

Community

Monthly Downloads129.3K

Likes246

Last Updated16 days ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

OpenMDW License Agreement v1.1View Full License

Performance & Scoring

Benchmarks

87.0

86.8

71.9

26.7

MBA Open Score

47.5CC

Benchmark40%

68.1

Popularity25%

31.1

Efficiency20%

4.2

Versatility15%

77.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	460.8 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	472.4 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	477.9 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	484.5 GB	Excellent	Near-lossless quality with manageable size
Q8_0	498.2 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	550.5 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ASUS ExpertCenter Pro ET900N G3ASUS	AA	12.1 tok/s	472.4 GB
Dell Pro Max with GB300Dell	AA	12.1 tok/s	472.4 GB
Gigabyte W775-V10-L01Gigabyte	AA	12.1 tok/s	472.4 GB
HP ZGX Fury AI StationHP	AA	12.1 tok/s	472.4 GB
MSI XpertStation WS300MSI	AA	12.1 tok/s	472.4 GB
SuperMicro Super AI StationSuperMicro	AA	12.1 tok/s	472.4 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	CC	1.4 tok/s	472.4 GB
Apple Mac Studio (M3 Ultra, 2025)Apple	CC	1.4 tok/s	472.4 GB
ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	FF	0.9 tok/s	472.4 GB
Acer Veriton GN100 AI MiniAcer	FF	0.5 tok/s	472.4 GB
AMD Instinct MI300XAMD	FF	9.0 tok/s	472.4 GB
AMD Instinct MI325XAMD	FF	10.2 tok/s	472.4 GB
AMD Instinct MI355XAMD	FF	13.6 tok/s	472.4 GB
AMD Radeon RX 7600 8GBAMD	FF	0.5 tok/s	472.4 GB
AMD Radeon RX 7700 XTAMD	FF	0.7 tok/s	472.4 GB
AMD Radeon RX 7800 XTAMD	FF	1.1 tok/s	472.4 GB
AMD Radeon RX 7900 XTAMD	FF	1.4 tok/s	472.4 GB
AMD Radeon RX 7900 XTXAMD	FF	1.6 tok/s	472.4 GB
AMD Radeon RX 9070AMD	FF	1.1 tok/s	472.4 GB
AMD Radeon RX 9070 XTAMD	FF	1.1 tok/s	472.4 GB
Apple M4Apple	FF	0.2 tok/s	472.4 GB
Apple M4 Max (40-core GPU)Apple	FF	0.9 tok/s	472.4 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	FF	0.5 tok/s	472.4 GB
Apple M5Apple	FF	0.3 tok/s	472.4 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	FF	1.0 tok/s	472.4 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on Apple M3 Ultra (32-core CPU, 80-core GPU) (~1.4 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Nemotron 3 Ultra on Apple M3 Ultra (32-core CPU, 80-core GPU) · ~1.4 tok/s · 160W	$3.82
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 472 GB VRAM, refreshed hourly.

No current rental listing covers this model’s VRAM requirement on the providers we track.

About This Model

Overview

Nemotron 3 Ultra is NVIDIA’s largest open-weight language model to date — a 550B-parameter Mixture-of-Experts (MoE) design that activates only 55B parameters per token. It is the final and most capable model in the Nemotron 3 family, built for practitioners who need frontier-level reasoning, long-context analysis, and agentic workflows without relying on cloud APIs.

What sets Nemotron 3 Ultra apart is its hybrid architecture: it interleaves Mamba-2 state-space layers, MoE feed-forward blocks, and selective attention layers. This combination reduces the KV cache footprint and attention cost, yielding up to 5.9× higher inference throughput than comparable open models like GLM-5.1-754B-A40B (8K input / 64K output setting). The model was pre-trained on 20 trillion tokens (cutoff September 2025) and post-trained through May 2026 using SFT, reinforcement learning, and Multi-teacher On-Policy Distillation (MOPD).

Nemotron 3 Ultra competes directly with other large MoE models such as Qwen-3.5-397B-A17B and Kimi-K2.6-1T-A32B. It scores 86.8 on MMLU-Pro, 70.7 on SWE-Bench Verified, 89.0 on LiveCodeBench v6, and 94.7 on RULER at 1M context — numbers that put it at the top of the open-weight leaderboard for both reasoning and long-context tasks. Its Artificial Analysis Intelligence Index of 47.7 reflects strong overall capability across diverse benchmarks.

Architecture & Technical Details

Nemotron 3 Ultra uses a LatentMoE design — a variant of Mixture-of-Experts where the router learns a compressed latent representation before selecting experts. This improves accuracy per active parameter compared to standard MoE. The hybrid backbone mixes:

Mamba-2 layers – state-space models that scale linearly with sequence length, reducing memory and compute for long contexts.
Standard attention layers – retained where full quadratic attention is beneficial for tasks requiring precise token interactions.
MoE feed-forward blocks – only 55B of the 550B total parameters are active per token, meaning inference speed is closer to a 55B dense model than a 550B one.

The model also includes Multi-Token Prediction (MTP) layers, which act as native speculative decoding by predicting multiple future tokens in parallel. This directly boosts tokens-per-second during autoregressive generation without requiring a separate draft model.

Context length is 1,000,000 tokens — achieved through long-context extension after pre-training. At 1M tokens, Nemotron 3 Ultra outperforms all other open LLMs on the RULER benchmark, making it viable for full-document analysis, massive codebases, or long-running agentic memory.

Precision: The model was pre-trained in NVFP4 (NVIDIA’s 4-bit floating point), and the released checkpoints include BF16 and NVFP4 quantized versions. NVFP4 quantized weights reduce memory footprint by roughly 4× versus BF16 while maintaining near-lossless accuracy on most benchmarks.

Capabilities & Use Cases

Nemotron 3 Ultra is a text-only model with a broad skill set: chat, code generation, reasoning, function-calling, multilingual support, math, and instruction-following. Its strengths align with the following real-world workloads:

Agentic coding and software engineering – SWE-Bench Verified 70.7 indicates strong ability to resolve real GitHub issues autonomously. The model can write, debug, and refactor code across languages, with a dedicated code pre-training dataset of 173B fresh GitHub tokens (cutoff September 2025).
Complex multi-step reasoning – MMLU-Pro 86.8 and LiveCodeBench v6 89.0 reflect robust logical deduction and competitive programming. The reasoning mode can be toggled on/off via chat template (enable_thinking=True/False), allowing you to control inference-time compute budget.
Long-context RAG and document analysis – With 1M token context and high RULER scores, the model can ingest entire technical manuals, legal contracts, or research papers in one pass, then answer questions or extract structured data.
Tool use and function-calling – Designed for agentic workflows, it can call external APIs, execute code, and manage multi-turn interactions with structured outputs.
Multilingual reasoning – Supports English, French, Spanish, Italian, German, Japanese, Korean, Hindi, Brazilian Portuguese, Chinese, and Arabic. Performance is strongest in English but holds up in the other languages for reasoning and instruction-following tasks.

Running Nemotron 3 Ultra Locally

This is not a model you run on a single consumer GPU. With 550B total parameters and 55B active, even the NVFP4 quantized version requires substantial hardware.

Hardware Requirements

Quantization	Min GPU Setup	Recommended GPU Setup
BF16 (full)	8x GB200/B200 or 16x H100 (80GB)	8x H200 (141GB)
NVFP4 (quantized)	4x GB200/B200 or 8x H100 (80GB)	4x GB300/B300

For practitioners without access to datacenter GPUs, the realistic path is renting multi-GPU instances or using a local cluster. Consumer cards like the RTX 4090 (24GB) cannot hold the model even with aggressive quantization — you would need at least 8× RTX 4090s in a multi-GPU setup to run NVFP4. Apple Silicon users with M4 Max (128GB unified memory) may be able to run the NVFP4 checkpoint using llama.cpp with Metal acceleration, but expect single-digit tokens per second.

Recommended Quantization

The NVFP4 checkpoint is the most practical for local deployment. It requires roughly 140GB of GPU memory (model weights) plus overhead for KV cache and activations. For multi-GPU setups, use tensor parallelism across 4-8 GPUs.

If you need to run on less memory, community GGUF quantizations (Q4_K_M, Q5_K_M) are likely to appear, but as of release only the official NVFP4 and BF16 checkpoints are available. The NVFP4 version offers the best memory-accuracy tradeoff — NVIDIA reports negligible accuracy loss on most benchmarks.

Expected Performance

On an 8× H100 (80GB) setup with NVFP4 and tensor parallelism, expect 40-60 tokens per second for short outputs (1K tokens) and 20-30 tokens per second for long generations (64K tokens). The MTP speculative decoding adds a 1.2-1.5× speedup over standard autoregressive decoding.

Ollama is not yet supported for this model size, but the recommended path is using NVIDIA’s Nemotron inference server or vLLM with the official checkpoints. For a quick local test, use the Hugging Face Transformers integration with device_map="auto" and load_in_4bit=True (via bitsandbytes) on a single node with 4-8 GPUs.

How It Compares

Nemotron 3 Ultra sits in a small class of frontier open-weight MoE models. The most relevant comparisons are:

vs Qwen-3.5-397B-A17B (397B total, 17B active):

Qwen has fewer active parameters (17B vs 55B), so it runs faster on the same hardware but scores lower on reasoning benchmarks (MMLU-Pro ~80 vs 86.8).
Nemotron’s 1M context window far exceeds Qwen’s 128K, making it the clear choice for long-document tasks.
Qwen is more widely supported in consumer quantization (GGUF) and can run on a single 48GB GPU (e.g., RTX 6000 Ada) with Q4, whereas Nemotron requires multi-GPU.

vs GLM-5.1-754B-A40B (754B total, 40B active):

GLM has more total parameters but fewer active per token (40B) and a smaller context (128K). Nemotron achieves 5.9× higher throughput on long outputs.
Nemotron’s SWE-Bench score (70.7) is significantly higher than GLM’s reported ~60, making it better for autonomous coding agents.
GLM is Chinese-centric; Nemotron has stronger multilingual support across Western and Asian languages.

When to choose Nemotron 3 Ultra: You need the highest open-weight accuracy for reasoning, coding, and long-context tasks, and you have the hardware to run it (multi-GPU workstation or cluster). It is the best option for agentic systems that must process hundreds of thousands of tokens without cloud round-trips.

When to choose an alternative: If you are limited to a single consumer GPU (24GB VRAM) or need quick deployment with Ollama, look at smaller models like Qwen-3.5-32B or Llama 4 Scout. For production inference at scale, Nemotron’s throughput advantage makes it compelling even on rented GPU instances.

Related Models

NVIDIA

Explore the Provider

See all NVIDIA models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every NVIDIA model we track.

Open NVIDIA

Explore the Family

See every Nemotron release

The full Nemotron family leaderboard with sizes, benchmark scores, and a release timeline.

Open Nemotron

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

550B

NVIDIA

Nemotron 3 Ultra

550B paramsMoE1000K ctx

View on Hugging Face Official Page

Our Take

Best for: Strongest at graduate-level reasoning (GPQA) in its size class

Run this onASUS ExpertCenter Pro ET900N G3Cheapest card in our directory with comfortable headroom (748 GB) for this model at Q4 (~472.4 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Reasoning

Function Calling

Multilingual

Math

Instruction Following

Model Specifications

Parameters550B

Active Params55B

ArchitectureMoE

Context Length1M tokens

ModalityText Only

Training Cutoff2025-09

ProviderNVIDIA

Download Size2.2 TB

Community

Monthly Downloads129.3K

Likes246

Last Updated16 days ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

OpenMDW License Agreement v1.1View Full License

Performance & Scoring

Benchmarks

87.0

86.8

71.9

26.7

MBA Open Score

47.5CC

Benchmark40%

68.1

Popularity25%

31.1

Efficiency20%

4.2

Versatility15%

77.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	460.8 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	472.4 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	477.9 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	484.5 GB	Excellent	Near-lossless quality with manageable size
Q8_0	498.2 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	550.5 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ASUS ExpertCenter Pro ET900N G3ASUS	AA	12.1 tok/s	472.4 GB
Dell Pro Max with GB300Dell	AA	12.1 tok/s	472.4 GB
Gigabyte W775-V10-L01Gigabyte	AA	12.1 tok/s	472.4 GB
HP ZGX Fury AI StationHP	AA	12.1 tok/s	472.4 GB
MSI XpertStation WS300MSI	AA	12.1 tok/s	472.4 GB
SuperMicro Super AI StationSuperMicro	AA	12.1 tok/s	472.4 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	CC	1.4 tok/s	472.4 GB
Apple Mac Studio (M3 Ultra, 2025)Apple	CC	1.4 tok/s	472.4 GB
ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	FF	0.9 tok/s	472.4 GB
Acer Veriton GN100 AI MiniAcer	FF	0.5 tok/s	472.4 GB
AMD Instinct MI300XAMD	FF	9.0 tok/s	472.4 GB
AMD Instinct MI325XAMD	FF	10.2 tok/s	472.4 GB
AMD Instinct MI355XAMD	FF	13.6 tok/s	472.4 GB
AMD Radeon RX 7600 8GBAMD	FF	0.5 tok/s	472.4 GB
AMD Radeon RX 7700 XTAMD	FF	0.7 tok/s	472.4 GB
AMD Radeon RX 7800 XTAMD	FF	1.1 tok/s	472.4 GB
AMD Radeon RX 7900 XTAMD	FF	1.4 tok/s	472.4 GB
AMD Radeon RX 7900 XTXAMD	FF	1.6 tok/s	472.4 GB
AMD Radeon RX 9070AMD	FF	1.1 tok/s	472.4 GB
AMD Radeon RX 9070 XTAMD	FF	1.1 tok/s	472.4 GB
Apple M4Apple	FF	0.2 tok/s	472.4 GB
Apple M4 Max (40-core GPU)Apple	FF	0.9 tok/s	472.4 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	FF	0.5 tok/s	472.4 GB
Apple M5Apple	FF	0.3 tok/s	472.4 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	FF	1.0 tok/s	472.4 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on Apple M3 Ultra (32-core CPU, 80-core GPU) (~1.4 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Nemotron 3 Ultra on Apple M3 Ultra (32-core CPU, 80-core GPU) · ~1.4 tok/s · 160W	$3.82
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 472 GB VRAM, refreshed hourly.

No current rental listing covers this model’s VRAM requirement on the providers we track.

About This Model

Overview

Architecture & Technical Details

Mamba-2 layers – state-space models that scale linearly with sequence length, reducing memory and compute for long contexts.
Standard attention layers – retained where full quadratic attention is beneficial for tasks requiring precise token interactions.
MoE feed-forward blocks – only 55B of the 550B total parameters are active per token, meaning inference speed is closer to a 55B dense model than a 550B one.

Capabilities & Use Cases

Agentic coding and software engineering – SWE-Bench Verified 70.7 indicates strong ability to resolve real GitHub issues autonomously. The model can write, debug, and refactor code across languages, with a dedicated code pre-training dataset of 173B fresh GitHub tokens (cutoff September 2025).
Complex multi-step reasoning – MMLU-Pro 86.8 and LiveCodeBench v6 89.0 reflect robust logical deduction and competitive programming. The reasoning mode can be toggled on/off via chat template (enable_thinking=True/False), allowing you to control inference-time compute budget.
Long-context RAG and document analysis – With 1M token context and high RULER scores, the model can ingest entire technical manuals, legal contracts, or research papers in one pass, then answer questions or extract structured data.
Tool use and function-calling – Designed for agentic workflows, it can call external APIs, execute code, and manage multi-turn interactions with structured outputs.
Multilingual reasoning – Supports English, French, Spanish, Italian, German, Japanese, Korean, Hindi, Brazilian Portuguese, Chinese, and Arabic. Performance is strongest in English but holds up in the other languages for reasoning and instruction-following tasks.

Running Nemotron 3 Ultra Locally

This is not a model you run on a single consumer GPU. With 550B total parameters and 55B active, even the NVFP4 quantized version requires substantial hardware.

Hardware Requirements

Quantization	Min GPU Setup	Recommended GPU Setup
BF16 (full)	8x GB200/B200 or 16x H100 (80GB)	8x H200 (141GB)
NVFP4 (quantized)	4x GB200/B200 or 8x H100 (80GB)	4x GB300/B300

Recommended Quantization

Expected Performance

How It Compares

Nemotron 3 Ultra sits in a small class of frontier open-weight MoE models. The most relevant comparisons are:

vs Qwen-3.5-397B-A17B (397B total, 17B active):

Qwen has fewer active parameters (17B vs 55B), so it runs faster on the same hardware but scores lower on reasoning benchmarks (MMLU-Pro ~80 vs 86.8).
Nemotron’s 1M context window far exceeds Qwen’s 128K, making it the clear choice for long-document tasks.
Qwen is more widely supported in consumer quantization (GGUF) and can run on a single 48GB GPU (e.g., RTX 6000 Ada) with Q4, whereas Nemotron requires multi-GPU.

vs GLM-5.1-754B-A40B (754B total, 40B active):

GLM has more total parameters but fewer active per token (40B) and a smaller context (128K). Nemotron achieves 5.9× higher throughput on long outputs.
Nemotron’s SWE-Bench score (70.7) is significantly higher than GLM’s reported ~60, making it better for autonomous coding agents.
GLM is Chinese-centric; Nemotron has stronger multilingual support across Western and Asian languages.