
NVIDIA Canary 1B v2 is a scaled multilingual speech recognition and translation model supporting 25 European languages with state-of-the-art accuracy and 10x faster inference than comparable models.
A strong 0.978B-parameter dense audio model from NVIDIA. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Cheapest current cloud rentals with at least 1 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA L4Vast.ai · Spot · 24 GB VRAM | $0.03 |
NVIDIA L4Vast.ai · On-Demand · 24 GB VRAM | $0.03 |
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM | $0.11 |
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM | $0.13 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
NVIDIA Canary 1B v2 is a multilingual speech recognition and translation model designed for local deployment. It transcribes and translates speech from 25 European languages with high accuracy and inference speeds roughly 10× faster than comparable models like Whisper-large-v3. At 0.978 billion parameters (dense, not Mixture of Experts), it sits in the sweet spot between lightweight footprint and production-grade performance.
Developed by NVIDIA and released under the permissive CC-BY-4.0 license, Canary 1B v2 is built for practitioners who need to run automatic speech recognition (ASR) and speech-to-text translation (AST) on their own hardware — no cloud APIs required. It competes directly with models like Whisper-large-v3 (1.5B parameters) and Seamless-M4T-v2-large, but with significantly lower compute demands.
This model is a standalone speech engine, not a multimodal LLM. It takes audio input and outputs text — either a transcript in the original language or an English translation. For developers building offline voice assistants, meeting transcription tools, or multilingual content pipelines, Canary 1B v2 offers a practical, high-performance option that fits on consumer GPUs.
Canary 1B v2 uses a FastConformer encoder paired with a Transformer decoder — a proven combination for speech tasks. The dense architecture means all 0.978B parameters are active during every forward pass, giving consistent latency and predictable VRAM usage. There is no context length specification, but the model processes streaming audio (chunked) and is optimized for real-time or near-real-time inference.
Key architectural traits:
For local inference, the model’s dense nature means no expert routing overhead — you get a flat memory and compute profile regardless of input length.
Canary 1B v2 is purpose-built for two tasks:
Supported languages: bg, hr, cs, da, nl, en, et, fi, fr, de, el, hu, it, lv, lt, mt, pl, pt, ro, sk, sl, es, sv, ru, uk. English is included as both source and target.
Concrete use cases:
Benchmark results on the FLEURS test set (standard multilingual speech benchmark) show Canary 1B v2 achieving a Word Error Rate (WER) of 4.5% on English (comparable to Whisper-large-v3) and between 4–12% on other European languages. For AST, BLEU scores range from 24–36 depending on language pair, and COMET scores (semantic translation quality) hover in the 76–83 range — competitive with models 2–5× larger.
This is not a general-purpose LLM or text generator. It is a focused speech engine that does one thing (transcribe/translate speech) and does it well with minimal hardware overhead.
Because it has fewer than 1 billion parameters, Canary 1B v2 is one of the most accessible high-quality speech models for local inference. Below are hardware requirements and performance expectations.
| Quantization | VRAM (approx.) | Notes |
|---|---|---|
| FP16 | ~2 GB | Full precision, best accuracy, fits most GPUs |
| INT8 (8-bit) | ~1 GB | Minor accuracy loss, common trade-off |
| Q4_K_M | ~0.6 GB | Good balance for most users |
| Q3_K_L | ~0.5 GB | Heavier quantization, suitable for edge devices |
Minimum recommended 4 GB total system VRAM (including audio buffering and runtime overhead). For audio processing, additional memory is used for feature extraction — typically 200–400 MB.
All measurements assume a single stream (no batching) on an RTX 4090:
Note: Tokens per second refers to output text tokens, not audio length. For ASR, 1 second of typical speech produces roughly 10–20 text tokens; 200 t/s means you can transcribe 10–20 seconds of speech per second of compute time.
The easiest local path is through Ollama. Canary 1B v2 is available in the Ollama model library (check ollama pull nvidia-canary-1b-v2). The model uses NeMo's inference engine under the hood, so you need the NeMo runtime installed (or rely on Ollama’s bundled container). For direct PyTorch usage, see the Hugging Face model card.
When to choose Canary: You work primarily with European languages, need maximum throughput, and want to run on consumer GPUs.
Parakeet-TDT-0.6B-v3 is NVIDIA’s smaller sibling, covering the same 25 languages but with only 600M parameters. Canary 1B v2 has ~60% more parameters and achieves lower WER on most languages (by 1–3 points). Parakeet is the choice when VRAM is extremely tight (e.g., edge devices with 2–3 GB). For most desktop users, Canary is the better pick for accuracy-critical work.
Tradeoff summary: Canary 1B v2 delivers Whisper-large-v3-tier accuracy with much lower compute cost, but only for European languages. If your workflow is Eurocentric, it’s the most efficient local speech model available today.

Explore the Provider
Aggregate stats, leaderboard, release timeline, and benchmark coverage across every NVIDIA model we track.