NVIDIA Canary 180M Flash is a compact 182M-parameter multilingual encoder-decoder ASR and translation model supporting 4 languages with >1200 RTFx inference speed, designed for mobile and edge deployment.
A strong 0.182B-parameter dense audio model from NVIDIA. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Canary-180M-Flash is the smallest member of the NVIDIA NeMo Canary Flash family, with 182 million parameters and inference speed of more than 1200 RTFx on open-asr-leaderboard datasets. It supports ASR in 4 languages (English, German, French, Spanish) and bidirectional translation between English and the other three languages, with optional punctuation and capitalization (PnC). It also offers experimental word-level and segment-level timestamps.
Architecture: Encoder-decoder with FastConformer encoder and Transformer decoder, based on the Canary Flash architecture. Uses a concatenated SentencePiece tokenizer.
Training: Trained using the NVIDIA NeMo framework for 219K steps with 2D bucketing and OOMptimizer on 32 NVIDIA A100 80GB GPUs.
Use cases: On-device speech recognition and translation (e.g., smartphones), real-time translation earbuds, low-latency voice assistants, and applications where privacy or offline use is required. Released under CC-BY-4.0 for commercial use.