NVIDIA's 30B parameter hybrid MoE model (activating 3B parameters) unifying text, image, video, and audio understanding. Designed as a low-latency perception and context sub-agent.
A strong 30B-parameter MoE language model from NVIDIA. High composite score across our benchmark mix — worth shortlisting when raw quality matters more than VRAM budget. Newly released, so production-readiness is still being shaken out.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
No benchmark data available for this model yet.
See how different quantization levels affect VRAM requirements and quality for this model.
| Format | VRAM Required | Quality | |
|---|---|---|---|
| Q2_K | 7.9 GB | Low | |
| Q4_K_MRecommended | 8.5 GB | Good | |
| Q5_K_M | 8.8 GB | Very Good | |
| Q6_K | 9.2 GB | Excellent | |
| Q8_0 | 9.9 GB | Near Perfect | |
| FP16 | 12.8 GB | Full |
See which devices can run this model and at what quality level.
| SS | 48.3 tok/s | 8.5 GB | ||
| SS | 40.8 tok/s | 8.5 GB | ||
| SS | 58.9 tok/s | 8.5 GB | ||
| SS | 60.4 tok/s | 8.5 GB | ||
| SS | 60.4 tok/s | 8.5 GB | ||
Google Cloud TPU v5eGoogle | SS | 77.3 tok/s | 8.5 GB | |
Intel Arc A770 16GBIntel | SS | 52.8 tok/s | 8.5 GB | |
Intel Arc B580Intel | SS | 43.0 tok/s | 8.5 GB | |
| SS | 90.6 tok/s | 8.5 GB | ||
NVIDIA GeForce RTX 4070NVIDIA | SS | 47.6 tok/s | 8.5 GB | |
| SS | 47.6 tok/s | 8.5 GB | ||
| SS | 63.4 tok/s | 8.5 GB | ||
| SS | 69.4 tok/s | 8.5 GB | ||
| SS | 42.3 tok/s | 8.5 GB | ||
NVIDIA GeForce RTX 5070NVIDIA | SS | 63.4 tok/s | 8.5 GB | |
| SS | 84.5 tok/s | 8.5 GB | ||
| SS | 90.6 tok/s | 8.5 GB | ||
| SS | 75.5 tok/s | 8.5 GB | ||
| SS | 90.6 tok/s | 8.5 GB | ||
NVIDIA GeForce RTX 3090NVIDIA | SS | 88.3 tok/s | 8.5 GB | |
| SS | 95.1 tok/s | 8.5 GB | ||
| SS | 154.7 tok/s | 8.5 GB | ||
| SS | 169.1 tok/s | 8.5 GB | ||
Origin PC M-CLASS v2Origin PC | SS | 169.1 tok/s | 8.5 GB | |
NVIDIA L40SNVIDIA | SS | 81.5 tok/s | 8.5 GB |
Energy cost on AMD Radeon RX 7600 8GB (~27 tok/s, Q4_K_M) vs flagship API pricing.
| Source | Cost per 1M tokens |
|---|---|
Local (energy only)Nemotron 3 Nano Omni on AMD Radeon RX 7600 8GB · ~27 tok/s · 165W | $0.202 |
GPT-5.5OpenAI · in $5.00 · out $30.00 | $12.50 |
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00 | $11.00 |
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00 | $3.75 |
Grok 4.3xAI · in $1.25 · out $2.50 | $1.63 |
API prices blended at 70% input / 30% output.
Hardware amortisation not included. Run the full ROI calculator for payback math.
Cheapest current cloud rentals with at least 9 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA GeForce RTX 3080Vast.ai · Spot · 10 GB VRAM | $0.04 |
NVIDIA GeForce RTX 3080Vast.ai · On-Demand · 10 GB VRAM | $0.05 |
NVIDIA’s Nemotron 3 Nano Omni is a 30-billion-parameter Mixture-of-Experts multimodal model that activates only 3B parameters per token. It is the first model in the Nemotron Nano family to natively handle audio alongside text, images, and video. Designed as a low-latency perception and context sub-agent, it targets enterprise workloads where speed, long-context understanding, and multimodal input are critical.
This model competes directly with other efficient MoE multimodal models such as Qwen3-VL-30B-A3B and surpasses its predecessor, Nemotron Nano V2 VL, on document understanding, audio-video comprehension, and agentic computer use. Released under the NVIDIA Open Model License, it is fully open for commercial use and available in BF16, FP8, and NVFP4 precisions.

Explore the Provider
Aggregate stats, leaderboard, release timeline, and benchmark coverage across every NVIDIA model we track.

Explore the Family
The full Nemotron family leaderboard with sizes, benchmark scores, and a release timeline.
Nemotron 3 Nano Omni is built on a Mamba2-Transformer hybrid MoE backbone (the Nemotron 3 Nano 30B-A3B language model) augmented with a C-RADIOv4-H vision encoder and a Parakeet-TDT-0.6B-v2 audio encoder. The MoE design means the model has 30B total parameters across multiple experts, but only ~3B are active for each token. This directly translates to lower VRAM consumption and faster inference than a dense 30B model, while retaining the knowledge capacity of a much larger network.
For local deployment, the combination of MoE efficiency and low-bit quantization (especially NVFP4 or common 4‑bit formats like Q4_K_M) makes this model accessible on consumer hardware that would struggle with a dense 30B model.
Nemotron 3 Nano Omni is a general-purpose multimodal model, but its strength lies in real-world, enterprise-facing tasks that combine vision, language, and audio.
When to choose this model: If your workload demands full multimodal fusion (video+speech+text) and you need to run efficiently on a single GPU, Nemotron 3 Nano Omni is a strong fit. For pure text-only tasks, a dedicated language MoE might be more efficient; for vision-only tasks, a specialized vision model may yield higher accuracy.
The model’s efficiency makes it one of the few 30B-class multimodal models that can run on a single consumer GPU with appropriate quantization.
| Precision | Minimum GPU | RAM / VRAM |
|---|---|---|
| BF16 | 1× H100 80GB | 62 GB |
| FP8 | 1× L40S 48GB or RTX Pro 6000 | 33 GB |
| NVFP4 | 1× RTX 5090 32GB or DGX Spark | 21 GB |
| Q4_K_M | 1× RTX 4090 24GB / M4 Max (64GB) | ~15–18 GB |
Realistic consumer setup:
q4_k_m via llama.cpp or Ollama). The model weights drop to ~15–16 GB, leaving room for KV cache. Expect 30–50 tokens per second for text generation, slightly lower for multimodal inputs due to encoder overhead. llama.cpp or mlx. The 64 GB unified memory handles the model comfortably, though GPU bandwidth is lower than NVIDIA’s.For most users, Q4_K_M offers the best balance of quality and performance. The official NVFP4 format is efficient but currently requires NVIDIA’s custom runtime or NeMo framework support. If you’re using Ollama, the community will likely provide pre-quantized q4_k_m and q4_k_m variants shortly after release.
The quickest way to run it locally is through Ollama (once the model is added to the library) or via the official HuggingFace checkpoints using transformers and vllm. For agentic workflows, consider using LLaMA.cpp with server mode for function-calling and multimodal inputs.
1# Example with Ollama (once available)2ollama run nvidia/nemotron-3-nano-omni34# Or with llama.cpp5./llama-cli -m Nemotron-3-Nano-Omni-30B-A3B-Q4_K_M.gguf --mmproj mmproj-model.gguf
Nemotron 3 Nano Omni vs Qwen3-VL-30B-A3B
Both are MoE models with 30B total and ~3B active parameters. Qwen3-VL-30B-A3B is also multimodal (text+images) and performs similarly on benchmarks. The key differentiators: Nemotron 3 Nano Omni adds native audio input and is optimized for long-context (256K vs 32K) and agentic computer use. Qwen3-VL may have stronger Chinese language performance and a larger ecosystem of fine-tuned variants. Choose Nemotron if your pipeline requires audio+video fusion or if you need the NVIDIA ecosystem (NIM, TensorRT-LLM).
Nemotron 3 Nano Omni vs a dense 30B model (e.g., Yi-34B)
Dense 30B models typically require 60–80 GB of VRAM at FP16 and deliver lower tokens per second on consumer hardware. Nemotron’s MoE design gives you the knowledge of a large model with only 3B active parameters, drastically cutting VRAM and latency. The trade-off: MoE models can be more sensitive to batch size and may have slightly higher perplexity on some narrow tasks. For most practitioners running locally on a single GPU, the efficiency advantage heavily favors MoE.
Best GPU for Nemotron 3 Nano Omni: If budget allows, an L40S 48GB (FP8) or RTX 5090 32GB (NVFP4) provides out-of-the-box support. For existing RTX 4090 owners, Q4_K_M quantization is a practical path with solid performance.
| $0.05 |
NVIDIA GeForce RTX 3090Vast.ai · On-Demand · 24 GB VRAM | $0.08 |
NVIDIA GeForce RTX 5070Vast.ai · Spot · 12 GB VRAM | $0.08 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.