NVIDIA Canary-Qwen 2.5B is a state-of-the-art hybrid Speech-Augmented Language Model (SALM) combining the Canary-1B-Flash encoder with a Qwen3-1.7B LLM decoder, achieving a record 5.63% WER on the Hugging Face Open ASR leaderboard.
A solid 2.5B-parameter dense audio model from NVIDIA. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Canary-Qwen-2.5B is an English speech-augmented language model that topped the Hugging Face Open ASR leaderboard at release with a record 5.63% WER while running at 418 RTFx. It operates in two modes:
Architecture: A Speech-Augmented Language Model (SALM) combining two base models: the nvidia/canary-1b-flash FastConformer encoder and Qwen/Qwen3-1.7B LLM decoder, connected via a linear projection and LoRA adapters applied to the LLM. The encoder's output frame rate is 80ms (12.5 tokens per second). The tokenizer is inherited from Qwen3-1.7B. LLM parameters were frozen during training; only the speech encoder, projection, and LoRA parameters were trainable.
Training: Trained using the NVIDIA NeMo toolkit for 90K steps on 32 NVIDIA A100 80GB GPUs. Approximately 1.3B tokens, ~40M (speech, text) pairs across 26 datasets. Max input audio was 40 seconds per training sample with max sequence length of 1024 tokens.
Use cases: Meeting summarization, podcast/interview transcription and analysis, enterprise voice-to-text with downstream LLM reasoning, agentic speech systems, accessibility services. Released under CC-BY-4.0 for commercial use.