
NVIDIA Canary-Qwen 2.5B is a state-of-the-art hybrid Speech-Augmented Language Model (SALM) combining the Canary-1B-Flash encoder with a Qwen3-1.7B LLM decoder, achieving a record 5.63% WER on the Hugging Face Open ASR leaderboard.
A solid 2.5B-parameter dense audio model from NVIDIA. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Cheapest current cloud rentals with at least 2 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM | $0.11 |
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM | $0.13 |
NVIDIA GeForce RTX 4090Vast.ai · Spot · 24 GB VRAM | $0.13 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
NVIDIA Canary-Qwen 2.5B is a hybrid Speech-Augmented Language Model (SALM) that sets a new benchmark for automatic speech recognition (ASR). Developed by NVIDIA and released under the permissive CC-BY-4.0 license, this 2.5-billion-parameter model achieves a record 5.63% word error rate (WER) on the Hugging Face Open ASR leaderboard—the lowest reported to date. It combines the Canary-1B-Flash speech encoder with a Qwen3-1.7B large language model (LLM) decoder, effectively merging transcription capabilities with downstream language understanding.
Unlike traditional ASR systems that only output raw text, Canary-Qwen 2.5B operates in two distinct modes: a pure transcription mode (ASR) and an LLM mode that retains the decoder’s reasoning, summarization, and question-answering skills. This makes it a single, locally-runable solution for workflows that require both accurate speech-to-text and post-processing of the transcript—without relying on cloud APIs. It competes with other 2-3B parameter ASR models, but its hybrid architecture and top-tier benchmark results place it in a class of its own for practitioners who need on-premise, real-time speech intelligence.
Canary-Qwen 2.5B is a dense Transformer model with 2.5 billion parameters. Its architecture is a dual-component pipeline:
The model is trained on a diverse corpus of English speech data, including LibriSpeech, Fisher, Switchboard, VoxPopuli, Common Voice, and several proprietary NVIDIA datasets (Granary, YTC, Yodas2). It supports automatic punctuation and capitalization (PnC) out of the box.
Key performance specs:
Because it is a dense model (not Mixture-of-Experts), all 2.5B parameters are active during every forward pass. This means VRAM requirements scale directly with model size and quantization level, with no sparsity tricks to reduce memory. Practitioners should plan for predictable memory usage based on standard 2.5B-parameter dense model calculations.
Canary-Qwen 2.5B is built for English speech recognition with optional LLM-driven post-processing. Its primary capabilities:
Concrete use cases:
Note: The model is English-only and text-output only. It does not generate speech or handle other languages.
This is where Canary-Qwen 2.5B shines: it can be run on a single consumer GPU with careful quantization. Here’s what you need to know to run it locally.
| Quantization | VRAM (approx) | Quality impact |
|---|---|---|
| FP16 (no quantization) | ~5.0 GB | Full precision |
| Q8_0 | ~3.0 GB | Negligible |
| Q4_K_M (recommended) | ~2.0 GB | Minimal WER increase (~0.5% absolute) |
| Q4_0 | ~1.8 GB | Slight degradation |
These estimates include inference overhead and assume a batch size of 1. For streaming or batch processing, add 1-2 GB for buffers.
Performance depends heavily on GPU and quantization. For an RTX 4090 at Q4_K_M:
On an M4 Max, expect similar speeds in ASR mode, with slightly lower LLM generation throughput due to metal backend differences.
Use [Ollama](https://ollama.ai) with the model’s GGUF conversion. As of mid-2026, community GGUF files for Canary-Qwen 2.5B are available. The command:
1ollama run nvidia-canary-qwen-2.5b
Alternatively, use NVIDIA NeMo’s inference scripts for full control over quantization and streaming.
vs. Whisper Large V3 (1.5B parameters)
vs. Parakeet-unified-en-0.6B (NVIDIA’s smaller model)
For anyone building a local, English-only ASR pipeline that also needs to analyze transcripts, NVIDIA Canary-Qwen 2.5B is currently the best open model at this size. Its combination of record-breaking WER, commercial-friendly license, and dual-mode architecture makes it a practical choice for developers who demand local inference without compromise.

Explore the Provider
Aggregate stats, leaderboard, release timeline, and benchmark coverage across every NVIDIA model we track.