NVIDIA's largest open-weight Nemotron model, a Latent Mixture-of-Experts design with 550B total parameters and 55B active per token. It uses a hybrid architecture that interleaves Mamba-2, MoE, and select attention layers, adds Multi-Token Prediction for faster generation, and supports a 1M-token context window. It is text-only, was pre-trained on data through September 2025 with post-training through May 2026, and ships under the OpenMDW License. On benchmarks it scores 86.8 on MMLU-Pro, 70.7 on SWE-Bench Verified, 89.0 on LiveCodeBench v6, and 94.7 on RULER at 1M context, with an Artificial Analysis Intelligence Index of 47.7.
A workable 550B-parameter MoE language model from NVIDIA. Pulls ahead on graduate-level reasoning (GPQA) (87/100), so reach for it when that's the dimension that matters. Newly released, so production-readiness is still being shaken out.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 460.8 GB | Low | |
| Q4_K_MRecommended | 472.4 GB | Good | |
| Q5_K_M | 477.9 GB | Very Good | |
| Q6_K | 484.5 GB | Excellent | |
| Q8_0 | 498.2 GB | Near Perfect | |
| FP16 | 550.5 GB | Full |
See which devices can run this model and at what quality level.
| AA | 12.1 tok/s | 472.4 GB | ||
| AA | 12.1 tok/s | 472.4 GB | ||
Gigabyte W775-V10-L01Gigabyte | AA | 12.1 tok/s | 472.4 GB | |
| AA | 12.1 tok/s | 472.4 GB | ||
| AA | 12.1 tok/s | 472.4 GB | ||
SuperMicro Super AI StationSuperMicro | AA | 12.1 tok/s | 472.4 GB | |
| CC | 1.4 tok/s | 472.4 GB | ||
| CC | 1.4 tok/s | 472.4 GB | ||
| FF | 0.9 tok/s | 472.4 GB | ||
| FF | 0.5 tok/s | 472.4 GB | ||
| FF | 9.0 tok/s | 472.4 GB | ||
| FF | 10.2 tok/s | 472.4 GB | ||
| FF | 13.6 tok/s | 472.4 GB | ||
| FF | 0.5 tok/s | 472.4 GB | ||
| FF | 0.7 tok/s | 472.4 GB | ||
| FF | 1.1 tok/s | 472.4 GB | ||
| FF | 1.4 tok/s | 472.4 GB | ||
| FF | 1.6 tok/s | 472.4 GB | ||
| FF | 1.1 tok/s | 472.4 GB | ||
| FF | 1.1 tok/s | 472.4 GB | ||
Apple M4Apple | FF | 0.2 tok/s | 472.4 GB | |
| FF | 0.9 tok/s | 472.4 GB | ||
| FF | 0.5 tok/s | 472.4 GB | ||
Apple M5Apple | FF | 0.3 tok/s | 472.4 GB | |
| FF | 1.0 tok/s | 472.4 GB |
Energy cost on Apple M3 Ultra (32-core CPU, 80-core GPU) (~1.4 tok/s, Q4_K_M) vs flagship API pricing.
| Source | Cost per 1M tokens |
|---|---|
Local (energy only)Nemotron 3 Ultra on Apple M3 Ultra (32-core CPU, 80-core GPU) · ~1.4 tok/s · 160W | $3.82 |
GPT-5.5OpenAI · in $5.00 · out $30.00 | $12.50 |
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00 | $11.00 |
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00 | $3.75 |
Grok 4.3xAI · in $1.25 · out $2.50 | $1.63 |
API prices blended at 70% input / 30% output.
Hardware amortisation not included. Run the full ROI calculator for payback math.
Cheapest current cloud rentals with at least 472 GB VRAM, refreshed hourly.
No current rental listing covers this model’s VRAM requirement on the providers we track.
Nemotron 3 Ultra is NVIDIA’s largest open-weight language model to date — a 550B-parameter Mixture-of-Experts (MoE) design that activates only 55B parameters per token. It is the final and most capable model in the Nemotron 3 family, built for practitioners who need frontier-level reasoning, long-context analysis, and agentic workflows without relying on cloud APIs.
What sets Nemotron 3 Ultra apart is its hybrid architecture: it interleaves Mamba-2 state-space layers, MoE feed-forward blocks, and selective attention layers. This combination reduces the KV cache footprint and attention cost, yielding up to 5.9× higher inference throughput than comparable open models like GLM-5.1-754B-A40B (8K input / 64K output setting). The model was pre-trained on 20 trillion tokens (cutoff September 2025) and post-trained through May 2026 using SFT, reinforcement learning, and Multi-teacher On-Policy Distillation (MOPD).
Nemotron 3 Ultra competes directly with other large MoE models such as Qwen-3.5-397B-A17B and Kimi-K2.6-1T-A32B. It scores 86.8 on MMLU-Pro, 70.7 on SWE-Bench Verified, 89.0 on LiveCodeBench v6, and 94.7 on RULER at 1M context — numbers that put it at the top of the open-weight leaderboard for both reasoning and long-context tasks. Its Artificial Analysis Intelligence Index of 47.7 reflects strong overall capability across diverse benchmarks.
Nemotron 3 Ultra uses a LatentMoE design — a variant of Mixture-of-Experts where the router learns a compressed latent representation before selecting experts. This improves accuracy per active parameter compared to standard MoE. The hybrid backbone mixes:
The model also includes Multi-Token Prediction (MTP) layers, which act as native speculative decoding by predicting multiple future tokens in parallel. This directly boosts tokens-per-second during autoregressive generation without requiring a separate draft model.
Context length is 1,000,000 tokens — achieved through long-context extension after pre-training. At 1M tokens, Nemotron 3 Ultra outperforms all other open LLMs on the RULER benchmark, making it viable for full-document analysis, massive codebases, or long-running agentic memory.
Precision: The model was pre-trained in NVFP4 (NVIDIA’s 4-bit floating point), and the released checkpoints include BF16 and NVFP4 quantized versions. NVFP4 quantized weights reduce memory footprint by roughly 4× versus BF16 while maintaining near-lossless accuracy on most benchmarks.
Nemotron 3 Ultra is a text-only model with a broad skill set: chat, code generation, reasoning, function-calling, multilingual support, math, and instruction-following. Its strengths align with the following real-world workloads:
enable_thinking=True/False), allowing you to control inference-time compute budget.This is not a model you run on a single consumer GPU. With 550B total parameters and 55B active, even the NVFP4 quantized version requires substantial hardware.
| Quantization | Min GPU Setup | Recommended GPU Setup |
|---|---|---|
| BF16 (full) | 8x GB200/B200 or 16x H100 (80GB) | 8x H200 (141GB) |
| NVFP4 (quantized) | 4x GB200/B200 or 8x H100 (80GB) | 4x GB300/B300 |
For practitioners without access to datacenter GPUs, the realistic path is renting multi-GPU instances or using a local cluster. Consumer cards like the RTX 4090 (24GB) cannot hold the model even with aggressive quantization — you would need at least 8× RTX 4090s in a multi-GPU setup to run NVFP4. Apple Silicon users with M4 Max (128GB unified memory) may be able to run the NVFP4 checkpoint using llama.cpp with Metal acceleration, but expect single-digit tokens per second.
The NVFP4 checkpoint is the most practical for local deployment. It requires roughly 140GB of GPU memory (model weights) plus overhead for KV cache and activations. For multi-GPU setups, use tensor parallelism across 4-8 GPUs.
If you need to run on less memory, community GGUF quantizations (Q4_K_M, Q5_K_M) are likely to appear, but as of release only the official NVFP4 and BF16 checkpoints are available. The NVFP4 version offers the best memory-accuracy tradeoff — NVIDIA reports negligible accuracy loss on most benchmarks.
On an 8× H100 (80GB) setup with NVFP4 and tensor parallelism, expect 40-60 tokens per second for short outputs (1K tokens) and 20-30 tokens per second for long generations (64K tokens). The MTP speculative decoding adds a 1.2-1.5× speedup over standard autoregressive decoding.
Ollama is not yet supported for this model size, but the recommended path is using NVIDIA’s Nemotron inference server or vLLM with the official checkpoints. For a quick local test, use the Hugging Face Transformers integration with device_map="auto" and load_in_4bit=True (via bitsandbytes) on a single node with 4-8 GPUs.
Nemotron 3 Ultra sits in a small class of frontier open-weight MoE models. The most relevant comparisons are:
vs Qwen-3.5-397B-A17B (397B total, 17B active):
vs GLM-5.1-754B-A40B (754B total, 40B active):
When to choose Nemotron 3 Ultra: You need the highest open-weight accuracy for reasoning, coding, and long-context tasks, and you have the hardware to run it (multi-GPU workstation or cluster). It is the best option for agentic systems that must process hundreds of thousands of tokens without cloud round-trips.
When to choose an alternative: If you are limited to a single consumer GPU (24GB VRAM) or need quick deployment with Ollama, look at smaller models like Qwen-3.5-32B or Llama 4 Scout. For production inference at scale, Nemotron’s throughput advantage makes it compelling even on rented GPU instances.

Explore the Provider
Aggregate stats, leaderboard, release timeline, and benchmark coverage across every NVIDIA model we track.

Explore the Family
The full Nemotron family leaderboard with sizes, benchmark scores, and a release timeline.