NVIDIA NeMo Canary 1B is a 1-billion-parameter multilingual encoder-decoder ASR and speech translation model supporting English, German, French, and Spanish.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Canary-1B is a multi-lingual, multi-tasking encoder-decoder speech model from the NVIDIA NeMo team. It supports automatic speech-to-text recognition (ASR) in 4 languages (English, German, French, Spanish) and bidirectional translation between English and the other three languages, with optional punctuation and capitalization (PnC).
Architecture: Encoder-decoder with a FastConformer encoder (24 layers) and a Transformer decoder (24 layers). Audio features extracted by the encoder are fed into the Transformer decoder along with task tokens (<source language>, <target language>, <task>, <toggle PnC>) to trigger autoregressive text generation. Uses a concatenated SentencePiece tokenizer combining per-language tokenizers.
Training: Trained on ~85,000 hours of labeled speech data from public and proprietary sources using NVIDIA NeMo on 128 A100 80GB GPUs.
Use cases: Transcription services, subtitle generation, voice assistants, real-time translation for meetings and video calls, and accessibility applications. Achieved state-of-the-art WER of ~6.5% average on the Hugging Face Open ASR leaderboard at release time.