NVIDIA Canary 1B Flash is a faster 883M-parameter multilingual encoder-decoder ASR and translation model supporting 4 languages, with >1000 RTFx inference speed.
A strong 0.883B-parameter dense audio model from NVIDIA. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Canary-1B-Flash is part of the NVIDIA NeMo Canary Flash family — a faster and more accurate variant of Canary-1B. With 883 million parameters and inference speed of more than 1000 RTFx on open-asr-leaderboard datasets, it supports ASR in 4 languages (English, German, French, Spanish) and bidirectional translation between English and those languages, with optional punctuation and capitalization (PnC). It also offers experimental word-level and segment-level timestamps.
Architecture: Encoder-decoder model with a FastConformer encoder (32 layers) and a Transformer decoder (4 layers), totaling 883M parameters. Task tokens like <target language>, <task>, <toggle timestamps>, <toggle PnC> prompt the decoder. Uses a concatenated SentencePiece tokenizer.
Training: Trained using the NVIDIA NeMo Framework for 200K steps with 2D bucketing and OOMptimizer on 128 NVIDIA A100 80GB GPUs.
Use cases: