
NVIDIA Canary 1B Flash is a faster 883M-parameter multilingual encoder-decoder ASR and translation model supporting 4 languages, with >1000 RTFx inference speed.
A solid 0.883B-parameter dense audio model from NVIDIA. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Cheapest current cloud rentals with at least 1 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM | $0.11 |
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM | $0.13 |
NVIDIA GeForce RTX 4090Vast.ai · Spot · 24 GB VRAM | $0.13 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
NVIDIA Canary 1B Flash is a multilingual automatic speech recognition (ASR) and speech translation model built for local inference at high speed. At 0.883B parameters (883 million), it sits in the small‑to‑medium tier of speech models, but its inference throughput—exceeding 1000 real‑time factor (RTFx)—puts it ahead of many larger alternatives. Developed by NVIDIA, it is part of the NeMo framework and released under the CC‑BY‑4.0 license.
This model matters for practitioners who need on‑device speech processing without cloud dependencies. It supports four languages (English, German, French, Spanish) and handles both transcription and cross‑lingual translation in a single encoder‑decoder architecture. Unlike larger dense models that require high‑end hardware, Canary 1B Flash runs comfortably on consumer GPUs and even some CPU setups with appropriate quantization.
The model fills a specific niche: fast, accurate, multilingual ASR that doesn’t demand 8+ GB of VRAM. Competing with the likes of OpenAI Whisper (medium, 769M) and NVIDIA’s own Parakeet‑0.6B, Canary 1B Flash trades a slightly larger parameter count for significantly better efficiency—especially in streaming or real‑time scenarios where RTFx matters more than raw WER gains.
Canary 1B Flash uses a dense encoder‑decoder architecture built on FastConformer. The encoder has 32 layers, and the decoder is a 4‑layer Transformer. FastConformer is a variant of Conformer that reduces computational overhead while preserving the ability to model long audio sequences. The model employs a concatenated tokenizer for multilingual processing, combining subword units across English, German, French, and Spanish.
Despite being a dense model (no mixture of experts), the parameter count of 0.883B makes it memory‑efficient. In FP16 precision, the weights occupy ~1.8 GB. With activation memory and framework overhead, a typical inference session requires about 2–3 GB of VRAM. Quantization to 8‑bit (FP8 or INT8) cuts that to under 1 GB, enabling deployment on integrated GPUs and some NPUs.
The model’s context length is not explicitly specified, but the NeMo framework’s default chunking handles long‑form audio automatically. Input can be raw audio (WAV, FLAC) sampled at 16 kHz. The output is text with optional punctuation, capitalisation, and word‑level timestamps.
Inference speed is benchmarked at >1000 RTFx on an NVIDIA A100 (likely with TensorRT optimisations). On a consumer RTX 4090, real‑world RTFx typically exceeds 500 even without extreme batching. This makes it suitable for real‑time transcription pipelines where latency is critical.
Canary 1B Flash’s primary capabilities are automatic speech recognition (ASR) and automatic speech translation (AST). It transcribes English, German, French, and Spanish, and can translate any of these languages to English (and in some directions between the others). The model was trained on 85,000 hours of multilingual speech from sources like LibriSpeech, Common Voice, VoxPopuli, and Fisher.
Concrete use cases:
Benchmarks reported on the Hugging Face card show WER of 2.87% on LibriSpeech other, 1.95% on SPGI Speech, and BLEU scores of 32.27 for En→De translation on FLEURS. These are competitive for a model of this size.
For most users, Q4_K_M is the sweet spot. It reduces memory footprint by ~75% while keeping WER degradation under 0.5%. If you need maximum speed on constrained hardware, try Q4_0 or Q2_K – the model is robust to aggressive quantization because of its dense architecture. Avoid FP8 on SM 7.5 and older; use INT8 for wider compatibility.
The simplest path is to download the model from Hugging Face and use NVIDIA NeMo’s inference scripts. There is no Ollama integration (this is a speech model, not an LLM). Instead, use:
1import nemo.collections.asr as nemo_asr2model = nemo_asr.models.EncDecMultiTaskModel.from_pretrained("nvidia/canary-1b-flash")
NeMo handles audio chunking, batching, and timestamp extraction. For production pipelines, export to TensorRT via the NeMo toolkit for optimal speed.
| GPU | Precision | RTFx |
|---|---|---|
| RTX 4090 | FP16 – TensorRT | 1200+ |
| RTX 3060 (12 GB) | INT8 – ONNX | 300 |
| M4 Max (24‑core) | FP16 | 250 |
| Jetson Orin NX 16 GB | FP16 | 150 |
RTFx above 1 means you can process more than one second of audio per second of compute. At 1000 RTFx, a one‑hour recording transcribes in ~3.6 seconds.
vs. OpenAI Whisper Medium (769M)
Whisper Medium supports 99 languages, but its encoder‑decoder is slower and requires around 3 GB at FP16. Canary 1B Flash achieves 2×–3× higher RTFx on the same GPU for the four supported languages. If you need broad language coverage, Whisper Medium is the better choice. If speed and low VRAM matter, Canary wins. Also, Whisper’s MIT license is permissive but NVIDIA’s CC‑BY‑4.0 is even more open.
vs. NVIDIA Parakeet‑0.6B (English‑only)
Parakeet is slightly smaller (0.6B) and optimised for streaming with minimum 160 ms latency. Canary 1B Flash is larger but offers full multilingual capability and speech translation. For English‑only scenarios where latency is the top priority, Parakeet may edge ahead. For a single model that does English, German, French, and Spanish, Canary 1B Flash is more versatile.
vs. Wav2Vec2‑Large (300M)
Wav2Vec2 is smaller but only does ASR, not translation. Canary 1B Flash is faster and more feature‑rich. If you need an ultra‑lightweight model for English ASR on a Raspberry Pi, Wav2Vec2 might still be appropriate, but for any modern GPU, Canary 1B Flash provides better accuracy and functionality.

Explore the Provider
Aggregate stats, leaderboard, release timeline, and benchmark coverage across every NVIDIA model we track.