NVIDIA

NVIDIA Canary 180M Flash

NVIDIA Canary 180M Flash is a compact 182M-parameter multilingual encoder-decoder ASR and translation model supporting 4 languages with >1200 RTFx inference speed, designed for mobile and edge deployment.

0.182B paramsDense

View on Hugging Face Source Code Official Page

Our Take

Best for: Open-source asr workloads

A solid 0.182B-parameter dense audio model from NVIDIA. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters0.182B

ArchitectureDense

ProviderNVIDIA

Download Size737 MB

Community

Monthly Downloads1.1K

Likes103

Last Updated1 years ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

CC-BY-4.0View Full License

Performance & Scoring

Benchmarks

WER

6.9%

MBA Open Score

69.8BB

Benchmark40%

86.1

Popularity25%

32.8

Efficiency25%

80.4

Versatility10%

70.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	0.6 GB
Acer Veriton GN100 AI MiniAcer	SS	0.6 GB
AMD Instinct MI300XAMD	SS	0.6 GB
AMD Instinct MI325XAMD	SS	0.6 GB
AMD Instinct MI355XAMD	SS	0.6 GB
AMD Radeon RX 7600 8GBAMD	SS	0.6 GB
AMD Radeon RX 7700 XTAMD	SS	0.6 GB
AMD Radeon RX 7800 XTAMD	SS	0.6 GB
AMD Radeon RX 7900 XTAMD	SS	0.6 GB
AMD Radeon RX 7900 XTXAMD	SS	0.6 GB
AMD Radeon RX 9070AMD	SS	0.6 GB
AMD Radeon RX 9070 XTAMD	SS	0.6 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	0.6 GB
Apple M4Apple	SS	0.6 GB
Apple M4 Max (40-core GPU)Apple	SS	0.6 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	0.6 GB
Apple M5Apple	SS	0.6 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	0.6 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	0.6 GB
Apple Mac Mini (M1, 2020)Apple	SS	0.6 GB
Apple Mac Mini (M2, 2023)Apple	SS	0.6 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	0.6 GB
Apple Mac Mini (M4, 2024)Apple	SS	0.6 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	0.6 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	0.6 GB

Rows per page

Page 1 of 5

Rent in the Cloud

Cheapest current cloud rentals with at least 1 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA L4Vast.ai · Spot · 24 GB VRAM	$0.03
NVIDIA L4Vast.ai · On-Demand · 24 GB VRAM	$0.03
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM	$0.13
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM	$0.13

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

NVIDIA Canary 180M Flash is a compact, multilingual automatic speech recognition (ASR) and speech-to-text translation model developed by NVIDIA. At only 0.182 billion parameters, it is designed explicitly for mobile and edge deployment where latency, power, and memory budgets are tight. Unlike larger ASR models that require datacenter GPUs or cloud APIs, Canary 180M Flash fits on a smartphone SoC, a Raspberry Pi 5, or the NPU of a laptop – while still delivering production-quality transcription and translation.

The model is a dense encoder-decoder architecture, not a mixture-of-experts. Every parameter is active during inference, which simplifies deployment and guarantees predictable memory usage. NVIDIA reports inference speeds exceeding 1200x real-time factor (RTFx) – meaning the model processes 1200 seconds of audio per second of compute. For a one-minute audio clip, inference completes in roughly 50 milliseconds.

Canary 180M Flash competes directly with other small-footprint ASR models like OpenAI Whisper Small (244M parameters) and Meta’s SeamlessM4T-Medium (1.2B parameters). Its key differentiator is the combination of size, speed, and native support for four languages (English, German, Spanish, French) in both transcription and translation tasks.

Architecture & Technical Details

Canary 180M Flash uses a FastConformer encoder paired with a Transformer decoder. The FastConformer variant is a streamlined version of the Conformer architecture that reduces computational overhead by merging consecutive time steps and using a simplified attention mechanism. This is what enables the extreme real-time factor on low-power hardware.

Parameters: 0.182B dense (all active)
Architecture Type: Encoder-decoder
Model Size (fp16): ~350 MB (weights only)
Quantization: Supports fp16, int8, and int4 precision
Context Window: Not specified; typical ASR models use sliding window of ~30 seconds of audio

Because it is a dense model, there is no sparse activation or expert routing to manage. VRAM usage scales linearly with precision and batch size. At fp16, the model’s weights occupy approximately 350 MB, leaving ample room for audio preprocessing and intermediate activations even on devices with 1 GB total system memory.

The model was trained on a diverse mix of datasets including LibriSpeech, Common Voice, VoxPopuli, EuroParl, Fisher, Switchboard, and the People’s Speech corpus. This broad training set contributes to robustness across accents, recording conditions, and speaking styles.

Capabilities & Use Cases

Canary 180M Flash supports two primary tasks:

Automatic Speech Recognition (ASR) – transcribe audio in any of the four supported languages.
Automatic Speech Translation (AST) – translate audio from one language to English, and vice versa (e.g., English audio → German text, German audio → English text).

Benchmarks (from the official Hugging Face model card):

Task	Dataset	Metric	Score
ASR (English)	LibriSpeech test-other	WER	2.87%
ASR (English)	Common Voice 16.1 (en)	WER	6.99%
ASR (German)	Common Voice 16.1 (de)	WER	4.03%
ASR (Spanish)	Common Voice 16.1 (es)	WER	3.31%
ASR (French)	Common Voice 16.1 (fr)	WER	5.88%
AST (En→De)	FLEURS	BLEU	32.27
AST (En→Es)	FLEURS	BLEU	22.60
AST (En→Fr)	FLEURS	BLEU	41.22
AST (De→En)	FLEURS	BLEU	35.50
AST (Fr→En)	FLEURS	BLEU	33.42

These are competitive numbers for a model of this size. The German ASR WER of 4.03% on Common Voice, for example, is within striking distance of much larger models.

Real-world use cases:

Real-time captioning on mobile – live subtitling for video calls, lectures, or meetings, with no cloud round-trip.
Voice-controlled edge devices – smart speakers, headphones, or automotive assistants that must process local commands with sub-200 ms latency.
Offline translation – travel or field applications where network access is unavailable or expensive.
Automated transcription pipelines – batch processing of audio files on low-cost hardware (e.g., a Raspberry Pi cluster or a secondary GPU in a home server).

The model does not handle speaker diarization or emotion recognition out of the box. It is a pure transcription/translation engine.

Running NVIDIA Canary 180M Flash Locally

This is where Canary 180M Flash shines: it runs on hardware you already own, often with no dedicated GPU required.

VRAM Requirements

Quantization is the main knob for adjusting memory usage.

Precision	VRAM (weights + overhead)	Notes
fp16 (default)	~500 MB	Recommended for best accuracy on GPU
int8	~280 MB	Good trade-off; slight WER increase (~0.3–0.5%)
int4	~180 MB	Lowest footprint; suitable for CPU/NPU

Most users will want int8 quantization for the best balance of speed, accuracy, and memory footprint.

Hardware Compatibility

Any consumer GPU: RTX 3060 (12 GB) can run the model in fp16 with batch size 16. RTX 4090 is overkill but will push inference to sub-10 ms per audio minute.
Integrated GPUs: Intel Iris Xe, AMD RDNA 3 iGPUs – runs comfortably at int8.
Apple Silicon: M1, M2, M3, M4 – runs natively on the Neural Engine or GPU. M4 Max sees real-time factors above 1000x.
CPU only: A modern x86 CPU (e.g., AMD Ryzen 5 or Intel Core i5 with AVX2) can achieve 50–100x RTFx using int4 quantization and the ONNX runtime.
NPUs: Qualcomm Hexagon, Apple Neural Engine, and other mobile NPUs are ideal targets due to the model’s small size.

Expected Performance (tokens per second)

Because the model outputs text tokens at a rate tied to audio duration (typically ~10 tokens per second of speech), the more useful metric is audio processing speed. On a RTX 4060 at int8, expect:

~800x RTFx (process 10 minutes of audio in <1 second)
Approx. 0.12 ms per audio millisecond

On an M2 MacBook Air (Neural Engine, int8):

~600x RTFx

On a Raspberry Pi 5 (CPU, int4):

~40x RTFx – still real-time for most use cases.

Quick Start

The fastest way to experiment is through NVIDIA’s NeMo framework. Install via pip:

1pip install nemo_toolkit[asr]

Then load and transcribe:

1import nemo.collections.asr as nemo_asr
2model = nemo_asr.models.EncDecMultiTaskModel.from_pretrained("nvidia/canary-180m-flash")
3transcript = model.transcribe(["audio.wav"])[0]
4print(transcript)

For CPU-only or quantized inference, convert the model to ONNX or use the torch.inference_mode path with torch.quantization.

How It Compares

Two direct alternatives at similar parameter counts are OpenAI Whisper Small (244M) and Meta SeamlessM4T-Medium (1.2B). Here is an honest assessment:

Aspect	Canary 180M Flash	Whisper Small	SeamlessM4T-Medium
Parameters	182M	244M	1.2B
Languages (ASR)	4	99	101
Translation	4 language pairs	English only (from any)	~100 pairs
Speed (RTFx on RTX 4090)	>1200x	~500x	~200x
Memory (fp16)	350 MB	500 MB	2.4 GB
License	CC-BY-4.0	MIT	CC-BY-NC 4.0

When to choose Canary 180M Flash:

You only need English + German + Spanish + French.
Latency is critical – mobile or edge applications require sub-50 ms processing.
You have severe memory constraints (e.g., embedded device with 1 GB RAM).
You want a permissive CC-BY-4.0 license for commercial use.

When to choose Whisper Small:

You need broad language support (99 languages).
Translation from any language to English is sufficient.
You are willing to accept 2x higher latency and 40% larger memory.

When to choose SeamlessM4T-Medium:

You need full bidirectional speech-to-speech translation.
You have a dedicated GPU and the latency trade-off is acceptable.
Non-commercial research use is fine (CC-BY-NC).

For the specific niche of offline, low-power, multilingual ASR with translation for four Western European languages, NVIDIA Canary 180M Flash is the most efficient option available today.

Related Models

NVIDIA

Explore the Provider

See all NVIDIA models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every NVIDIA model we track.

Open NVIDIA

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

NVIDIA

NVIDIA Canary 180M Flash

0.182B paramsDense

View on Hugging Face Source Code Official Page

Our Take

Best for: Open-source asr workloads

A solid 0.182B-parameter dense audio model from NVIDIA. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters0.182B

ArchitectureDense

ProviderNVIDIA

Download Size737 MB

Community

Monthly Downloads1.1K

Likes103

Last Updated1 years ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

CC-BY-4.0View Full License

Performance & Scoring

Benchmarks

WER

6.9%

MBA Open Score

69.8BB

Benchmark40%

86.1

Popularity25%

32.8

Efficiency25%

80.4

Versatility10%

70.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	0.6 GB
Acer Veriton GN100 AI MiniAcer	SS	0.6 GB
AMD Instinct MI300XAMD	SS	0.6 GB
AMD Instinct MI325XAMD	SS	0.6 GB
AMD Instinct MI355XAMD	SS	0.6 GB
AMD Radeon RX 7600 8GBAMD	SS	0.6 GB
AMD Radeon RX 7700 XTAMD	SS	0.6 GB
AMD Radeon RX 7800 XTAMD	SS	0.6 GB
AMD Radeon RX 7900 XTAMD	SS	0.6 GB
AMD Radeon RX 7900 XTXAMD	SS	0.6 GB
AMD Radeon RX 9070AMD	SS	0.6 GB
AMD Radeon RX 9070 XTAMD	SS	0.6 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	0.6 GB
Apple M4Apple	SS	0.6 GB
Apple M4 Max (40-core GPU)Apple	SS	0.6 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	0.6 GB
Apple M5Apple	SS	0.6 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	0.6 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	0.6 GB
Apple Mac Mini (M1, 2020)Apple	SS	0.6 GB
Apple Mac Mini (M2, 2023)Apple	SS	0.6 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	0.6 GB
Apple Mac Mini (M4, 2024)Apple	SS	0.6 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	0.6 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	0.6 GB

Rows per page

Page 1 of 5

Rent in the Cloud

Cheapest current cloud rentals with at least 1 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA L4Vast.ai · Spot · 24 GB VRAM	$0.03
NVIDIA L4Vast.ai · On-Demand · 24 GB VRAM	$0.03
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM	$0.13
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM	$0.13

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

Architecture & Technical Details

Parameters: 0.182B dense (all active)
Architecture Type: Encoder-decoder
Model Size (fp16): ~350 MB (weights only)
Quantization: Supports fp16, int8, and int4 precision
Context Window: Not specified; typical ASR models use sliding window of ~30 seconds of audio

Capabilities & Use Cases

Canary 180M Flash supports two primary tasks:

Automatic Speech Recognition (ASR) – transcribe audio in any of the four supported languages.
Automatic Speech Translation (AST) – translate audio from one language to English, and vice versa (e.g., English audio → German text, German audio → English text).

Benchmarks (from the official Hugging Face model card):

Task	Dataset	Metric	Score
ASR (English)	LibriSpeech test-other	WER	2.87%
ASR (English)	Common Voice 16.1 (en)	WER	6.99%
ASR (German)	Common Voice 16.1 (de)	WER	4.03%
ASR (Spanish)	Common Voice 16.1 (es)	WER	3.31%
ASR (French)	Common Voice 16.1 (fr)	WER	5.88%
AST (En→De)	FLEURS	BLEU	32.27
AST (En→Es)	FLEURS	BLEU	22.60
AST (En→Fr)	FLEURS	BLEU	41.22
AST (De→En)	FLEURS	BLEU	35.50
AST (Fr→En)	FLEURS	BLEU	33.42

These are competitive numbers for a model of this size. The German ASR WER of 4.03% on Common Voice, for example, is within striking distance of much larger models.

Real-world use cases:

Real-time captioning on mobile – live subtitling for video calls, lectures, or meetings, with no cloud round-trip.
Voice-controlled edge devices – smart speakers, headphones, or automotive assistants that must process local commands with sub-200 ms latency.
Offline translation – travel or field applications where network access is unavailable or expensive.
Automated transcription pipelines – batch processing of audio files on low-cost hardware (e.g., a Raspberry Pi cluster or a secondary GPU in a home server).

The model does not handle speaker diarization or emotion recognition out of the box. It is a pure transcription/translation engine.

Running NVIDIA Canary 180M Flash Locally

This is where Canary 180M Flash shines: it runs on hardware you already own, often with no dedicated GPU required.

VRAM Requirements

Quantization is the main knob for adjusting memory usage.

Precision	VRAM (weights + overhead)	Notes
fp16 (default)	~500 MB	Recommended for best accuracy on GPU
int8	~280 MB	Good trade-off; slight WER increase (~0.3–0.5%)
int4	~180 MB	Lowest footprint; suitable for CPU/NPU

Most users will want int8 quantization for the best balance of speed, accuracy, and memory footprint.

Hardware Compatibility

Any consumer GPU: RTX 3060 (12 GB) can run the model in fp16 with batch size 16. RTX 4090 is overkill but will push inference to sub-10 ms per audio minute.
Integrated GPUs: Intel Iris Xe, AMD RDNA 3 iGPUs – runs comfortably at int8.
Apple Silicon: M1, M2, M3, M4 – runs natively on the Neural Engine or GPU. M4 Max sees real-time factors above 1000x.
CPU only: A modern x86 CPU (e.g., AMD Ryzen 5 or Intel Core i5 with AVX2) can achieve 50–100x RTFx using int4 quantization and the ONNX runtime.
NPUs: Qualcomm Hexagon, Apple Neural Engine, and other mobile NPUs are ideal targets due to the model’s small size.

Expected Performance (tokens per second)

~800x RTFx (process 10 minutes of audio in <1 second)
Approx. 0.12 ms per audio millisecond

On an M2 MacBook Air (Neural Engine, int8):

~600x RTFx

On a Raspberry Pi 5 (CPU, int4):

~40x RTFx – still real-time for most use cases.

Quick Start

The fastest way to experiment is through NVIDIA’s NeMo framework. Install via pip:

1pip install nemo_toolkit[asr]

Then load and transcribe:

1import nemo.collections.asr as nemo_asr
2model = nemo_asr.models.EncDecMultiTaskModel.from_pretrained("nvidia/canary-180m-flash")
3transcript = model.transcribe(["audio.wav"])[0]
4print(transcript)

For CPU-only or quantized inference, convert the model to ONNX or use the torch.inference_mode path with torch.quantization.

How It Compares

Two direct alternatives at similar parameter counts are OpenAI Whisper Small (244M) and Meta SeamlessM4T-Medium (1.2B). Here is an honest assessment:

Aspect	Canary 180M Flash	Whisper Small	SeamlessM4T-Medium
Parameters	182M	244M	1.2B
Languages (ASR)	4	99	101
Translation	4 language pairs	English only (from any)	~100 pairs
Speed (RTFx on RTX 4090)	>1200x	~500x	~200x
Memory (fp16)	350 MB	500 MB	2.4 GB
License	CC-BY-4.0	MIT	CC-BY-NC 4.0

When to choose Canary 180M Flash:

You only need English + German + Spanish + French.
Latency is critical – mobile or edge applications require sub-50 ms processing.
You have severe memory constraints (e.g., embedded device with 1 GB RAM).
You want a permissive CC-BY-4.0 license for commercial use.

When to choose Whisper Small:

You need broad language support (99 languages).
Translation from any language to English is sufficient.
You are willing to accept 2x higher latency and 40% larger memory.

When to choose SeamlessM4T-Medium:

You need full bidirectional speech-to-speech translation.
You have a dedicated GPU and the latency trade-off is acceptable.
Non-commercial research use is fine (CC-BY-NC).

For the specific niche of offline, low-power, multilingual ASR with translation for four Western European languages, NVIDIA Canary 180M Flash is the most efficient option available today.