NVIDIA

Nemotron 3 Nano Omni

NVIDIA's 30B parameter hybrid MoE model (activating 3B parameters) unifying text, image, video, and audio understanding. Designed as a low-latency perception and context sub-agent.

30B paramsMoE262K ctxMultimodal

View on Hugging Face Official Page

Our Take

Best for: Strongest at IFBench in its size class

A workable 30B-parameter MoE language model from NVIDIA. Pulls ahead on IFBench (63/100), so reach for it when that's the dimension that matters.

Run this onAMD Radeon RX 7700 XTCheapest card in our directory with comfortable headroom (12 GB) for this model at Q4 (~8.5 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Vision

Reasoning

Function Calling

Multilingual

Math

Instruction Following

Model Specifications

Parameters30B

Active Params3B

ArchitectureMoE

Context Length262K tokens

ModalityMultimodal

ProviderNVIDIA

Download Size66.1 GB

Community

Monthly Downloads1.0M

Likes370

Last Updated1 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

NVIDIA Open Model License AgreementView Full License

Performance & Scoring

Benchmarks

GPQA

46.9

HLE

5.3

AA Intelligence Index

14.9

27.8

8.3

63.2

35.7

MBA Open Score

54.7CC

Benchmark40%

28.9

Popularity25%

49.7

Efficiency20%

80.3

Versatility15%

97.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	7.9 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	8.5 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	8.8 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	9.2 GB	Excellent	Near-lossless quality with manageable size
Q8_0	9.9 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	12.8 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	48.3 tok/s	8.5 GB
AMD Radeon RX 7700 XTAMD	SS	40.8 tok/s	8.5 GB
AMD Radeon RX 7800 XTAMD	SS	58.9 tok/s	8.5 GB
AMD Radeon RX 9070AMD	SS	60.4 tok/s	8.5 GB
AMD Radeon RX 9070 XTAMD	SS	60.4 tok/s	8.5 GB
Google Cloud TPU v5eGoogle	SS	77.3 tok/s	8.5 GB
Intel Arc A770 16GBIntel	SS	52.8 tok/s	8.5 GB
Intel Arc B580Intel	SS	43.0 tok/s	8.5 GB
NOVATECH AI Workstation (i9-14900K + RTX 5080)NOVATECH	SS	90.6 tok/s	8.5 GB
NVIDIA GeForce RTX 4070NVIDIA	SS	47.6 tok/s	8.5 GB
NVIDIA GeForce RTX 4070 SUPERNVIDIA	SS	47.6 tok/s	8.5 GB
NVIDIA GeForce RTX 4070 Ti SUPERNVIDIA	SS	63.4 tok/s	8.5 GB
NVIDIA GeForce RTX 4080 SUPERNVIDIA	SS	69.4 tok/s	8.5 GB
NVIDIA GeForce RTX 5060 Ti 16GBNVIDIA	SS	42.3 tok/s	8.5 GB
NVIDIA GeForce RTX 5070NVIDIA	SS	63.4 tok/s	8.5 GB
NVIDIA GeForce RTX 5070 TiNVIDIA	SS	84.5 tok/s	8.5 GB
NVIDIA GeForce RTX 5080 Founders EditionNVIDIA	SS	90.6 tok/s	8.5 GB
AMD Radeon RX 7900 XTAMD	SS	75.5 tok/s	8.5 GB
AMD Radeon RX 7900 XTXAMD	SS	90.6 tok/s	8.5 GB
NVIDIA GeForce RTX 3090NVIDIA	SS	88.3 tok/s	8.5 GB
NVIDIA GeForce RTX 4090 Founders EditionNVIDIA	SS	95.1 tok/s	8.5 GB
Google Cloud TPU v6e (Trillium)Google	SS	154.7 tok/s	8.5 GB
NVIDIA GeForce RTX 5090 Founders EditionNVIDIA	SS	169.1 tok/s	8.5 GB
Origin PC M-CLASS v2Origin PC	SS	169.1 tok/s	8.5 GB
NVIDIA L40SNVIDIA	SS	81.5 tok/s	8.5 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on AMD Radeon RX 7600 8GB (~27 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Nemotron 3 Nano Omni on AMD Radeon RX 7600 8GB · ~27 tok/s · 165W	$0.202
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 9 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA GeForce RTX 5060 TiVast.ai · Spot · 16 GB VRAM	$0.07
NVIDIA GeForce RTX 5070Vast.ai · Spot · 12 GB VRAM	$0.07
NVIDIA GeForce RTX 5060 TiVast.ai · On-Demand · 16 GB VRAM	$0.08
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM	$0.10
NVIDIA GeForce RTX 5070Vast.ai · On-Demand · 12 GB VRAM	$0.10

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

NVIDIA’s Nemotron 3 Nano Omni is a 30-billion-parameter Mixture-of-Experts multimodal model that activates only 3B parameters per token. It is the first model in the Nemotron Nano family to natively handle audio alongside text, images, and video. Designed as a low-latency perception and context sub-agent, it targets enterprise workloads where speed, long-context understanding, and multimodal input are critical.

This model competes directly with other efficient MoE multimodal models such as Qwen3-VL-30B-A3B and surpasses its predecessor, Nemotron Nano V2 VL, on document understanding, audio-video comprehension, and agentic computer use. Released under the NVIDIA Open Model License, it is fully open for commercial use and available in BF16, FP8, and NVFP4 precisions.

Architecture & Technical Details

Nemotron 3 Nano Omni is built on a Mamba2-Transformer hybrid MoE backbone (the Nemotron 3 Nano 30B-A3B language model) augmented with a C-RADIOv4-H vision encoder and a Parakeet-TDT-0.6B-v2 audio encoder. The MoE design means the model has 30B total parameters across multiple experts, but only ~3B are active for each token. This directly translates to lower VRAM consumption and faster inference than a dense 30B model, while retaining the knowledge capacity of a much larger network.

Context length: 262,144 tokens (256K in practice) – enough to process hour-long meeting recordings, dense documents, or multi-frame video inputs without aggressive chunking.
Token reduction: NVIDIA applied innovative multimodal token-reduction techniques to reduce latency further, making it suitable for real-time agentic workflows.
Precision options and VRAM (approximate):
BF16: 62 GB – requires a single H100 80GB or H200/B200.
FP8: 33 GB – fits on a single L40S 48GB or RTX Pro 6000.
NVFP4: 21 GB – fits on an RTX 5090 32GB, DGX Spark, or Jetson Thor.
License: NVIDIA Open Model License Agreement – permits commercial use, modification, and redistribution with minimal restrictions.

For local deployment, the combination of MoE efficiency and low-bit quantization (especially NVFP4 or common 4‑bit formats like Q4_K_M) makes this model accessible on consumer hardware that would struggle with a dense 30B model.

Capabilities & Use Cases

Nemotron 3 Nano Omni is a general-purpose multimodal model, but its strength lies in real-world, enterprise-facing tasks that combine vision, language, and audio.

Video + audio comprehension: Process meeting recordings, training videos, or surveillance footage end-to-end. The model can transcribe speech, describe visual scenes, and answer questions about both simultaneously. Example use: analyzing a customer service call video to verify drop-off location via OCR and speech transcription.
Document intelligence: OCR, chart interpretation, and reasoning over long contracts, SOW/MSA, scientific papers, and financial documents. The 256K context allows ingesting entire multi-page PDFs without summarization.
Agentic computer use: GUI automation, browser agents, incident management. The model can interpret screenshots of a desktop, call functions, and navigate interfaces. NVIDIA specifically highlights browser agents and email agents.
Code and reasoning: Supports function-calling, mathematical reasoning, and instruction following. Suitable for coding assistants that need to interpret diagrams or output files.
Multilingual: While language coverage is not specified, the model inherits multilingual capabilities from the Nemotron 3 Nano backbone and can handle text in multiple languages.
Audio input: Native support for speech transcription and analysis without requiring a separate ASR pipeline.

When to choose this model: If your workload demands full multimodal fusion (video+speech+text) and you need to run efficiently on a single GPU, Nemotron 3 Nano Omni is a strong fit. For pure text-only tasks, a dedicated language MoE might be more efficient; for vision-only tasks, a specialized vision model may yield higher accuracy.

Running Nemotron 3 Nano Omni Locally

The model’s efficiency makes it one of the few 30B-class multimodal models that can run on a single consumer GPU with appropriate quantization.

Hardware requirements by precision

Precision	Minimum GPU	RAM / VRAM
BF16	1× H100 80GB	62 GB
FP8	1× L40S 48GB or RTX Pro 6000	33 GB
NVFP4	1× RTX 5090 32GB or DGX Spark	21 GB
Q4_K_M	1× RTX 4090 24GB / M4 Max (64GB)	~15–18 GB

Realistic consumer setup:

RTX 4090 (24 GB): Use a 4-bit quantization (e.g., q4_k_m via llama.cpp or Ollama). The model weights drop to ~15–16 GB, leaving room for KV cache. Expect 30–50 tokens per second for text generation, slightly lower for multimodal inputs due to encoder overhead.
RTX 5090 (32 GB): Use the official NVFP4 checkpoint directly. Tokens per second may reach 40–60+ depending on context length and prompt complexity.
Apple M4 Max (64 GB): Use Q4_K_M via llama.cpp or mlx. The 64 GB unified memory handles the model comfortably, though GPU bandwidth is lower than NVIDIA’s.

Recommended quantization

For most users, Q4_K_M offers the best balance of quality and performance. The official NVFP4 format is efficient but currently requires NVIDIA’s custom runtime or NeMo framework support. If you’re using Ollama, the community will likely provide pre-quantized q4_k_m and q4_k_m variants shortly after release.

Getting started

The quickest way to run it locally is through Ollama (once the model is added to the library) or via the official HuggingFace checkpoints using transformers and vllm. For agentic workflows, consider using LLaMA.cpp with server mode for function-calling and multimodal inputs.

1# Example with Ollama (once available)
2ollama run nvidia/nemotron-3-nano-omni
3
4# Or with llama.cpp
5./llama-cli -m Nemotron-3-Nano-Omni-30B-A3B-Q4_K_M.gguf --mmproj mmproj-model.gguf

How It Compares

Nemotron 3 Nano Omni vs Qwen3-VL-30B-A3B

Both are MoE models with 30B total and ~3B active parameters. Qwen3-VL-30B-A3B is also multimodal (text+images) and performs similarly on benchmarks. The key differentiators: Nemotron 3 Nano Omni adds native audio input and is optimized for long-context (256K vs 32K) and agentic computer use. Qwen3-VL may have stronger Chinese language performance and a larger ecosystem of fine-tuned variants. Choose Nemotron if your pipeline requires audio+video fusion or if you need the NVIDIA ecosystem (NIM, TensorRT-LLM).

Nemotron 3 Nano Omni vs a dense 30B model (e.g., Yi-34B)

Dense 30B models typically require 60–80 GB of VRAM at FP16 and deliver lower tokens per second on consumer hardware. Nemotron’s MoE design gives you the knowledge of a large model with only 3B active parameters, drastically cutting VRAM and latency. The trade-off: MoE models can be more sensitive to batch size and may have slightly higher perplexity on some narrow tasks. For most practitioners running locally on a single GPU, the efficiency advantage heavily favors MoE.

Best GPU for Nemotron 3 Nano Omni: If budget allows, an L40S 48GB (FP8) or RTX 5090 32GB (NVFP4) provides out-of-the-box support. For existing RTX 4090 owners, Q4_K_M quantization is a practical path with solid performance.

Related Models

NVIDIA

Explore the Provider

See all NVIDIA models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every NVIDIA model we track.

Open NVIDIA

Explore the Family

See every Nemotron release

The full Nemotron family leaderboard with sizes, benchmark scores, and a release timeline.

Open Nemotron

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

NVIDIA

Nemotron 3 Nano Omni

NVIDIA's 30B parameter hybrid MoE model (activating 3B parameters) unifying text, image, video, and audio understanding. Designed as a low-latency perception and context sub-agent.

30B paramsMoE262K ctxMultimodal

View on Hugging Face Official Page

Our Take

Best for: Strongest at IFBench in its size class

A workable 30B-parameter MoE language model from NVIDIA. Pulls ahead on IFBench (63/100), so reach for it when that's the dimension that matters.

Run this onAMD Radeon RX 7700 XTCheapest card in our directory with comfortable headroom (12 GB) for this model at Q4 (~8.5 GB).

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Capabilities

Chat

Code Generation

Vision

Reasoning

Function Calling

Multilingual

Math

Instruction Following

Model Specifications

Parameters30B

Active Params3B

ArchitectureMoE

Context Length262K tokens

ModalityMultimodal

ProviderNVIDIA

Download Size66.1 GB

Community

Monthly Downloads1.0M

Likes370

Last Updated1 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

NVIDIA Open Model License AgreementView Full License

Performance & Scoring

Benchmarks

GPQA

46.9

HLE

5.3

AA Intelligence Index

14.9

27.8

8.3

63.2

35.7

MBA Open Score

54.7CC

Benchmark40%

28.9

Popularity25%

49.7

Efficiency20%

80.3

Versatility15%

97.5

Quantization Options

See how different quantization levels affect VRAM requirements and quality for this model.

Format	VRAM Required	Quality
Q2_K	7.9 GB	Low	Aggressive quantization — smallest size, noticeable quality loss
Q4_K_MRecommended	8.5 GB	Good	Best balance of size and quality for most use-cases
Q5_K_M	8.8 GB	Very Good	Slightly better quality than Q4 with moderate size increase
Q6_K	9.2 GB	Excellent	Near-lossless quality with manageable size
Q8_0	9.9 GB	Near Perfect	Virtually indistinguishable from full precision
FP16	12.8 GB	Full	Full 16-bit floating point — maximum quality, largest size

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	48.3 tok/s	8.5 GB
AMD Radeon RX 7700 XTAMD	SS	40.8 tok/s	8.5 GB
AMD Radeon RX 7800 XTAMD	SS	58.9 tok/s	8.5 GB
AMD Radeon RX 9070AMD	SS	60.4 tok/s	8.5 GB
AMD Radeon RX 9070 XTAMD	SS	60.4 tok/s	8.5 GB
Google Cloud TPU v5eGoogle	SS	77.3 tok/s	8.5 GB
Intel Arc A770 16GBIntel	SS	52.8 tok/s	8.5 GB
Intel Arc B580Intel	SS	43.0 tok/s	8.5 GB
NOVATECH AI Workstation (i9-14900K + RTX 5080)NOVATECH	SS	90.6 tok/s	8.5 GB
NVIDIA GeForce RTX 4070NVIDIA	SS	47.6 tok/s	8.5 GB
NVIDIA GeForce RTX 4070 SUPERNVIDIA	SS	47.6 tok/s	8.5 GB
NVIDIA GeForce RTX 4070 Ti SUPERNVIDIA	SS	63.4 tok/s	8.5 GB
NVIDIA GeForce RTX 4080 SUPERNVIDIA	SS	69.4 tok/s	8.5 GB
NVIDIA GeForce RTX 5060 Ti 16GBNVIDIA	SS	42.3 tok/s	8.5 GB
NVIDIA GeForce RTX 5070NVIDIA	SS	63.4 tok/s	8.5 GB
NVIDIA GeForce RTX 5070 TiNVIDIA	SS	84.5 tok/s	8.5 GB
NVIDIA GeForce RTX 5080 Founders EditionNVIDIA	SS	90.6 tok/s	8.5 GB
AMD Radeon RX 7900 XTAMD	SS	75.5 tok/s	8.5 GB
AMD Radeon RX 7900 XTXAMD	SS	90.6 tok/s	8.5 GB
NVIDIA GeForce RTX 3090NVIDIA	SS	88.3 tok/s	8.5 GB
NVIDIA GeForce RTX 4090 Founders EditionNVIDIA	SS	95.1 tok/s	8.5 GB
Google Cloud TPU v6e (Trillium)Google	SS	154.7 tok/s	8.5 GB
NVIDIA GeForce RTX 5090 Founders EditionNVIDIA	SS	169.1 tok/s	8.5 GB
Origin PC M-CLASS v2Origin PC	SS	169.1 tok/s	8.5 GB
NVIDIA L40SNVIDIA	SS	81.5 tok/s	8.5 GB

Rows per page

Page 1 of 5

Run Locally vs API

Energy cost on AMD Radeon RX 7600 8GB (~27 tok/s, Q4_K_M) vs flagship API pricing.

Source	Cost per 1M tokens
Local (energy only)Nemotron 3 Nano Omni on AMD Radeon RX 7600 8GB · ~27 tok/s · 165W	$0.202
GPT-5.5OpenAI · in $5.00 · out $30.00	$12.50
Claude Opus 4.7 ThinkingAnthropic · in $5.00 · out $25.00	$11.00
Gemini 3.5 FlashGoogle · in $1.50 · out $9.00	$3.75
Grok 4.3xAI · in $1.25 · out $2.50	$1.63

API prices blended at 70% input / 30% output.

Hardware amortisation not included. Run the full ROI calculator for payback math.

Run the full ROI calculator

Rent in the Cloud

Cheapest current cloud rentals with at least 9 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA GeForce RTX 5060 TiVast.ai · Spot · 16 GB VRAM	$0.07
NVIDIA GeForce RTX 5070Vast.ai · Spot · 12 GB VRAM	$0.07
NVIDIA GeForce RTX 5060 TiVast.ai · On-Demand · 16 GB VRAM	$0.08
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM	$0.10
NVIDIA GeForce RTX 5070Vast.ai · On-Demand · 12 GB VRAM	$0.10

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Architecture & Technical Details

Context length: 262,144 tokens (256K in practice) – enough to process hour-long meeting recordings, dense documents, or multi-frame video inputs without aggressive chunking.
Token reduction: NVIDIA applied innovative multimodal token-reduction techniques to reduce latency further, making it suitable for real-time agentic workflows.
Precision options and VRAM (approximate):
BF16: 62 GB – requires a single H100 80GB or H200/B200.
FP8: 33 GB – fits on a single L40S 48GB or RTX Pro 6000.
NVFP4: 21 GB – fits on an RTX 5090 32GB, DGX Spark, or Jetson Thor.
License: NVIDIA Open Model License Agreement – permits commercial use, modification, and redistribution with minimal restrictions.

Capabilities & Use Cases

Nemotron 3 Nano Omni is a general-purpose multimodal model, but its strength lies in real-world, enterprise-facing tasks that combine vision, language, and audio.

Video + audio comprehension: Process meeting recordings, training videos, or surveillance footage end-to-end. The model can transcribe speech, describe visual scenes, and answer questions about both simultaneously. Example use: analyzing a customer service call video to verify drop-off location via OCR and speech transcription.
Document intelligence: OCR, chart interpretation, and reasoning over long contracts, SOW/MSA, scientific papers, and financial documents. The 256K context allows ingesting entire multi-page PDFs without summarization.
Agentic computer use: GUI automation, browser agents, incident management. The model can interpret screenshots of a desktop, call functions, and navigate interfaces. NVIDIA specifically highlights browser agents and email agents.
Code and reasoning: Supports function-calling, mathematical reasoning, and instruction following. Suitable for coding assistants that need to interpret diagrams or output files.
Multilingual: While language coverage is not specified, the model inherits multilingual capabilities from the Nemotron 3 Nano backbone and can handle text in multiple languages.
Audio input: Native support for speech transcription and analysis without requiring a separate ASR pipeline.

Running Nemotron 3 Nano Omni Locally

The model’s efficiency makes it one of the few 30B-class multimodal models that can run on a single consumer GPU with appropriate quantization.

Hardware requirements by precision

Precision	Minimum GPU	RAM / VRAM
BF16	1× H100 80GB	62 GB
FP8	1× L40S 48GB or RTX Pro 6000	33 GB
NVFP4	1× RTX 5090 32GB or DGX Spark	21 GB
Q4_K_M	1× RTX 4090 24GB / M4 Max (64GB)	~15–18 GB

Realistic consumer setup:

RTX 4090 (24 GB): Use a 4-bit quantization (e.g., q4_k_m via llama.cpp or Ollama). The model weights drop to ~15–16 GB, leaving room for KV cache. Expect 30–50 tokens per second for text generation, slightly lower for multimodal inputs due to encoder overhead.
RTX 5090 (32 GB): Use the official NVFP4 checkpoint directly. Tokens per second may reach 40–60+ depending on context length and prompt complexity.
Apple M4 Max (64 GB): Use Q4_K_M via llama.cpp or mlx. The 64 GB unified memory handles the model comfortably, though GPU bandwidth is lower than NVIDIA’s.

Recommended quantization

Getting started

1# Example with Ollama (once available)
2ollama run nvidia/nemotron-3-nano-omni
3
4# Or with llama.cpp
5./llama-cli -m Nemotron-3-Nano-Omni-30B-A3B-Q4_K_M.gguf --mmproj mmproj-model.gguf