NVIDIA Canary-Qwen 2.5B is a state-of-the-art hybrid Speech-Augmented Language Model (SALM) combining the Canary-1B-Flash encoder with a Qwen3-1.7B LLM decoder, achieving a record 5.63% WER on the Hugging Face Open ASR leaderboard.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Canary-Qwen-2.5B is an English speech-augmented language model that topped the Hugging Face Open ASR leaderboard at release with a record 5.63% WER while running at 418 RTFx. It operates in two modes:
Architecture: A Speech-Augmented Language Model (SALM) combining two base models: the nvidia/canary-1b-flash FastConformer encoder and Qwen/Qwen3-1.7B LLM decoder, connected via a linear projection and LoRA adapters applied to the LLM. The encoder's output frame rate is 80ms (12.5 tokens per second). The tokenizer is inherited from Qwen3-1.7B. LLM parameters were frozen during training; only the speech encoder, projection, and LoRA parameters were trainable.
Training: Trained using the NVIDIA NeMo toolkit for 90K steps on 32 NVIDIA A100 80GB GPUs. Approximately 1.3B tokens, ~40M (speech, text) pairs across 26 datasets. Max input audio was 40 seconds per training sample with max sequence length of 1024 tokens.
Use cases: Meeting summarization, podcast/interview transcription and analysis, enterprise voice-to-text with downstream LLM reasoning, agentic speech systems, accessibility services. Released under CC-BY-4.0 for commercial use.