NVIDIA Canary 1B v2 is a scaled multilingual speech recognition and translation model supporting 25 European languages with state-of-the-art accuracy and 10x faster inference than comparable models.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Canary-1b-v2 is a scaled and enhanced version of the Canary family featuring 978 million parameters, supporting 25 European languages (expanded from 4 in canary-1b/canary-1b-flash). It is the first NeMo model to leverage the full NVIDIA Granary dataset plus NeMo ASR Set 3.0, demonstrating multitask (ASR + speech-to-text translation) and multilingual capabilities. It offers quality comparable to models 3× larger while running up to 10× faster.
Architecture: Encoder-decoder with FastConformer encoder (32 layers) and Transformer decoder (8 layers), 978M parameters. Uses a unified SentencePiece tokenizer with a vocabulary of 16,384 tokens optimized across all 25 supported languages.
Training: Trained on the Granary dataset (improved pseudo-labels and filtered corpora) combined with NeMo ASR Set 3.0 human-labeled data. All transcripts include punctuation and capitalization.
Languages: Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, Swedish, Russian, Ukrainian.
Features: Automatic punctuation and capitalization, word and segment-level timestamps, dynamic chunking for long-form transcription, robust noise performance. Tops the Hugging Face multilingual open-ASR leaderboard at release.