
NVIDIA Canary 180M Flash is a compact 182M-parameter multilingual encoder-decoder ASR and translation model supporting 4 languages with >1200 RTFx inference speed, designed for mobile and edge deployment.
A solid 0.182B-parameter dense audio model from NVIDIA. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Cheapest current cloud rentals with at least 1 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA L4Vast.ai · Spot · 24 GB VRAM | $0.03 |
NVIDIA L4Vast.ai · On-Demand · 24 GB VRAM | $0.03 |
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM | $0.11 |
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM | $0.13 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
NVIDIA Canary 180M Flash is a compact, multilingual automatic speech recognition (ASR) and speech-to-text translation model developed by NVIDIA. At only 0.182 billion parameters, it is designed explicitly for mobile and edge deployment where latency, power, and memory budgets are tight. Unlike larger ASR models that require datacenter GPUs or cloud APIs, Canary 180M Flash fits on a smartphone SoC, a Raspberry Pi 5, or the NPU of a laptop – while still delivering production-quality transcription and translation.
The model is a dense encoder-decoder architecture, not a mixture-of-experts. Every parameter is active during inference, which simplifies deployment and guarantees predictable memory usage. NVIDIA reports inference speeds exceeding 1200x real-time factor (RTFx) – meaning the model processes 1200 seconds of audio per second of compute. For a one-minute audio clip, inference completes in roughly 50 milliseconds.
Canary 180M Flash competes directly with other small-footprint ASR models like OpenAI Whisper Small (244M parameters) and Meta’s SeamlessM4T-Medium (1.2B parameters). Its key differentiator is the combination of size, speed, and native support for four languages (English, German, Spanish, French) in both transcription and translation tasks.
Canary 180M Flash uses a FastConformer encoder paired with a Transformer decoder. The FastConformer variant is a streamlined version of the Conformer architecture that reduces computational overhead by merging consecutive time steps and using a simplified attention mechanism. This is what enables the extreme real-time factor on low-power hardware.
Because it is a dense model, there is no sparse activation or expert routing to manage. VRAM usage scales linearly with precision and batch size. At fp16, the model’s weights occupy approximately 350 MB, leaving ample room for audio preprocessing and intermediate activations even on devices with 1 GB total system memory.
The model was trained on a diverse mix of datasets including LibriSpeech, Common Voice, VoxPopuli, EuroParl, Fisher, Switchboard, and the People’s Speech corpus. This broad training set contributes to robustness across accents, recording conditions, and speaking styles.
Canary 180M Flash supports two primary tasks:
Benchmarks (from the official Hugging Face model card):
| Task | Dataset | Metric | Score |
|---|---|---|---|
| ASR (English) | LibriSpeech test-other | WER | 2.87% |
| ASR (English) | Common Voice 16.1 (en) | WER | 6.99% |
| ASR (German) | Common Voice 16.1 (de) | WER | 4.03% |
| ASR (Spanish) | Common Voice 16.1 (es) | WER | 3.31% |
| ASR (French) | Common Voice 16.1 (fr) | WER | 5.88% |
| AST (En→De) | FLEURS | BLEU | 32.27 |
| AST (En→Es) | FLEURS | BLEU | 22.60 |
| AST (En→Fr) | FLEURS | BLEU | 41.22 |
| AST (De→En) | FLEURS | BLEU | 35.50 |
| AST (Fr→En) | FLEURS | BLEU | 33.42 |
These are competitive numbers for a model of this size. The German ASR WER of 4.03% on Common Voice, for example, is within striking distance of much larger models.
Real-world use cases:
The model does not handle speaker diarization or emotion recognition out of the box. It is a pure transcription/translation engine.
This is where Canary 180M Flash shines: it runs on hardware you already own, often with no dedicated GPU required.
Quantization is the main knob for adjusting memory usage.
| Precision | VRAM (weights + overhead) | Notes |
|---|---|---|
| fp16 (default) | ~500 MB | Recommended for best accuracy on GPU |
| int8 | ~280 MB | Good trade-off; slight WER increase (~0.3–0.5%) |
| int4 | ~180 MB | Lowest footprint; suitable for CPU/NPU |
Most users will want int8 quantization for the best balance of speed, accuracy, and memory footprint.
Because the model outputs text tokens at a rate tied to audio duration (typically ~10 tokens per second of speech), the more useful metric is audio processing speed. On a RTX 4060 at int8, expect:
On an M2 MacBook Air (Neural Engine, int8):
On a Raspberry Pi 5 (CPU, int4):
The fastest way to experiment is through NVIDIA’s NeMo framework. Install via pip:
1pip install nemo_toolkit[asr]
Then load and transcribe:
1import nemo.collections.asr as nemo_asr2model = nemo_asr.models.EncDecMultiTaskModel.from_pretrained("nvidia/canary-180m-flash")3transcript = model.transcribe(["audio.wav"])[0]4print(transcript)
For CPU-only or quantized inference, convert the model to ONNX or use the torch.inference_mode path with torch.quantization.
Two direct alternatives at similar parameter counts are OpenAI Whisper Small (244M) and Meta SeamlessM4T-Medium (1.2B). Here is an honest assessment:
| Aspect | Canary 180M Flash | Whisper Small | SeamlessM4T-Medium |
|---|---|---|---|
| Parameters | 182M | 244M | 1.2B |
| Languages (ASR) | 4 | 99 | 101 |
| Translation | 4 language pairs | English only (from any) | ~100 pairs |
| Speed (RTFx on RTX 4090) | >1200x | ~500x | ~200x |
| Memory (fp16) | 350 MB | 500 MB | 2.4 GB |
| License | CC-BY-4.0 | MIT | CC-BY-NC 4.0 |
When to choose Canary 180M Flash:
When to choose Whisper Small:
When to choose SeamlessM4T-Medium:
For the specific niche of offline, low-power, multilingual ASR with translation for four Western European languages, NVIDIA Canary 180M Flash is the most efficient option available today.

Explore the Provider
Aggregate stats, leaderboard, release timeline, and benchmark coverage across every NVIDIA model we track.