NVIDIA Canary 1B v2 is a scaled multilingual speech recognition and translation model supporting 25 European languages with state-of-the-art accuracy and 10x faster inference than comparable models.
A strong 0.978B-parameter dense audio model from NVIDIA. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Canary-1b-v2 is a scaled and enhanced version of the Canary family featuring 978 million parameters, supporting 25 European languages (expanded from 4 in canary-1b/canary-1b-flash). It is the first NeMo model to leverage the full NVIDIA Granary dataset plus NeMo ASR Set 3.0, demonstrating multitask (ASR + speech-to-text translation) and multilingual capabilities. It offers quality comparable to models 3× larger while running up to 10× faster.
Architecture: Encoder-decoder with FastConformer encoder (32 layers) and Transformer decoder (8 layers), 978M parameters. Uses a unified SentencePiece tokenizer with a vocabulary of 16,384 tokens optimized across all 25 supported languages.
Training: Trained on the Granary dataset (improved pseudo-labels and filtered corpora) combined with NeMo ASR Set 3.0 human-labeled data. All transcripts include punctuation and capitalization.
Languages: Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish, Russian, Ukrainian.
Features: Automatic punctuation and capitalization, word and segment-level timestamps, dynamic chunking for long-form transcription, robust noise performance. Tops the Hugging Face multilingual open-ASR leaderboard at release.