NVIDIA

NVIDIA Canary 1B

NVIDIA NeMo Canary 1B is a 1-billion-parameter multilingual encoder-decoder ASR and speech translation model supporting English, German, French, and Spanish.

1B paramsDense

View on Hugging Face Source Code Official Page

Our Take

Best for: Open-source asr workloads

A solid 1B-parameter dense audio model from NVIDIA. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters1B

ArchitectureDense

ProviderNVIDIA

Download Size8.4 GB

Community

Monthly Downloads1.8K

Likes457

Last Updated6 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

CC-BY-NC-4.0View Full License

Performance & Scoring

Benchmarks

WER

6.5%

MBA Open Score

67.9BB

Benchmark40%

87.0

Popularity25%

47.8

Efficiency25%

56.5

Versatility10%

70.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	1.1 GB
Acer Veriton GN100 AI MiniAcer	SS	1.1 GB
AMD Instinct MI300XAMD	SS	1.1 GB
AMD Instinct MI325XAMD	SS	1.1 GB
AMD Instinct MI355XAMD	SS	1.1 GB
AMD Radeon RX 7600 8GBAMD	SS	1.1 GB
AMD Radeon RX 7700 XTAMD	SS	1.1 GB
AMD Radeon RX 7800 XTAMD	SS	1.1 GB
AMD Radeon RX 7900 XTAMD	SS	1.1 GB
AMD Radeon RX 7900 XTXAMD	SS	1.1 GB
AMD Radeon RX 9070AMD	SS	1.1 GB
AMD Radeon RX 9070 XTAMD	SS	1.1 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	1.1 GB
Apple M4Apple	SS	1.1 GB
Apple M4 Max (40-core GPU)Apple	SS	1.1 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	1.1 GB
Apple M5Apple	SS	1.1 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	1.1 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	1.1 GB
Apple Mac Mini (M1, 2020)Apple	SS	1.1 GB
Apple Mac Mini (M2, 2023)Apple	SS	1.1 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	1.1 GB
Apple Mac Mini (M4, 2024)Apple	SS	1.1 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	1.1 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	1.1 GB

Rows per page

Page 1 of 5

Rent in the Cloud

Cheapest current cloud rentals with at least 1 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM	$0.13
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM	$0.13
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM	$0.13
NVIDIA GeForce RTX 4090Vast.ai · Spot · 24 GB VRAM	$0.13

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

NVIDIA Canary 1B is a 1-billion-parameter encoder-decoder model purpose-built for automatic speech recognition (ASR) and speech-to-text translation (AST). Developed by NVIDIA’s NeMo team, it transcribes and translates speech across four languages: English, German, French, and Spanish. It’s a dense transformer model, not mixture-of-experts, so all 1B parameters are active per inference — a predictable trade-off that makes VRAM requirements straightforward to calculate.

This model sits in the small-to-medium ASR tier, competing directly with other open-weight speech models like Whisper small (244M) or Whisper medium (769M). But Canary 1B is not just larger; it’s designed with a task-prompt mechanism that lets you specify source language, target language, and punctuation style in a single pass. That flexibility, combined with strong benchmark results, makes it a practical choice for developers who need local speech pipelines without cloud dependencies.

Architecture & Technical Details

Canary 1B uses a FastConformer encoder paired with a Transformer decoder. The encoder extracts audio features from raw waveform inputs, and the decoder generates text tokens conditioned on a task-specific prompt. That prompt is a set of special tokens appended to the decoder input — a clean, controllable interface for switching between transcription and translation tasks.

The architecture is dense with 1B parameters. No sparsity, no routing overhead. That means VRAM consumption scales linearly with precision and batch size. For a single streaming instance, you’re looking at roughly 2 GB at FP16 (plus overhead for runtime and audio processing). For batch inference or higher throughput, plan for 4-8 GB depending on sequence length.

Context length is not specified by NVIDIA, but typical encoder-decoder ASR models process audio segments rather than long-form text — segment slicing is expected during inference. The model supports sample rates up to 16 kHz, which is standard for speech.

Training data includes LibriSpeech, Fisher, Switchboard, Common Voice, VoxPopuli, EuroParl, and others. The model was trained within the NeMo framework, which also provides the recommended inference pipeline. The license is CC-BY-NC-4.0 — not for commercial use without additional NVIDIA licensing (the NIM container has separate terms).

Capabilities & Use Cases

Canary 1B handles two primary tasks:

Automatic Speech Recognition (ASR): Transcribe audio in English, German, French, or Spanish. Word error rates (WER) on standard benchmarks are competitive — 2.89% on LibriSpeech other, 4.61% on Common Voice German, 3.99% on Spanish, 6.53% on French. Punctuation and capitalization are supported out of the box.

Automatic Speech Translation (AST): Bidirectional translation between English and each of the other three languages. For example, Spanish audio to English text, or English audio to German text. BLEU scores on FLEURS: En→De 32.15, De→En 33.98, En→Es 22.66, En→Fr 40.76.

Concrete use cases:

Local transcription for meeting recordings — run on a laptop with an RTX 4060 or M4 Max, no cloud egress.
Multilingual customer support — transcribe and translate agent-customer calls in real time or post-call analysis.
Subtitling and localization pipelines — batch audio files with language detection and translation in one model.
Accessibility tools — real-time captioning for live events with multilingual audiences.

Because it’s a 1B model, it runs comfortably on consumer GPUs (details below) and can even be deployed on Apple Silicon with respectable token-per-second rates.

Running NVIDIA Canary 1B Locally

This is where Canary 1B shines for practitioners. You can run it on a single consumer GPU or even on high-end integrated memory. Here’s what you need to know.

VRAM Requirements

Quantization	Min VRAM (approx.)	Recommended VRAM
FP16 (default)	2 GB	4 GB
Q8_0	1.5 GB	3 GB
Q4_K_M	1 GB	2 GB

These numbers are for single-stream inference. With batching or large audio segments, add 20-50%. The model is small enough that FP16 on a 6 GB GPU is perfectly viable.

Consumer Hardware

RTX 4090 (24 GB): Overkill. You can run multiple concurrent streams or batch inference. Expect near-real-time transcription (audio processing faster than real time).
RTX 4060 (8 GB): Excellent. Q8_0 or FP16, single stream. Real-time or faster.
RTX 3060 (12 GB): Comfortable at FP16, even with moderate batching.
M4 Max (64 GB unified): Runs FP16 effortlessly. Inference speed depends on CPU/GPU balance — expect ~0.2-0.5x real-time factor using MLX or llama.cpp backend.
Apple M1/M2 with 16 GB: Viable at Q4_K_M via llama.cpp or whisper.cpp-like implementations.

Performance (Tokens per Second)

ASR models are usually benchmarked by real-time factor (RTF) rather than tokens per second. For a 30-second audio clip:

RTX 4090 (FP16): RTF ~0.05 (processes 30s audio in ~1.5s)
RTX 4060 (Q4_K_M): RTF ~0.1 (30s in ~3s)
M4 Max (FP16): RTF ~0.08 (30s in ~2.4s)

These are estimates; actual performance varies by audio length, language, and implementation. For real-time streaming, most consumer GPUs can keep up with live microphone input.

Best Quantization for NVIDIA Canary 1B

Q4_K_M is the sweet spot for most users — halves VRAM with negligible WER increase. Q8_0 provides near-lossless quality but uses more memory. FP16 is only necessary if you need extreme accuracy on noisy audio or specialized benchmarks.

How to Run

The quickest way: use Ollama’s built-in ASR support (if available) or NVIDIA’s NeMo inference container. For Python, you can load the model from Hugging Face with nemo.collections.asr.models.EncDecRNNTBPEModel (Canary uses an RNNT decoder). A simple inference script:

1import nemo.collections.asr as nemo_asr
2model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained("nvidia/canary-1b")
3transcription = model.transcribe(["audio.wav"])

Ollama’s llama.cpp backend doesn’t natively support encoder-decoder ASR yet, but whisper.cpp can be adapted — check the latest community builds for Canary support.

Hardware Requirements Summary

Minimum: 4 GB GPU VRAM (Q4_K_M), 8 GB system RAM, 16 kHz audio input.
Recommended: 8 GB GPU (RTX 4060 or better), 16 GB RAM, SSD for model storage (~1.7 GB for Q4_K_M).
Apple Silicon: 16 GB unified memory minimum; 24 GB+ recommended for batch or streaming.

How It Compares

Canary 1B competes directly with OpenAI Whisper medium (769M) and small (244M) on ASR, and with SeamlessM4T (2B) on translation — though SeamlessM4T is larger and covers more languages.

Model	Parameters	Languages	ASR WER (LibriSpeech other)	Translation	License
Canary 1B	1B	en, de, fr, es	2.89%	Bidirectional en↔de,fr,es	CC-BY-NC-4.0
Whisper medium	769M	99	4.5% (approx)	No native translation	MIT
Whisper small	244M	99	6.2% (approx)	No native translation	MIT
SeamlessM4T (2B)	2.3B	100+	N/A ASR focus	100+ languages	CC-BY-NC-4.0

When to choose Canary 1B: You need high-accuracy ASR for English, German, French, or Spanish — and you want built-in translation between those languages in a single model. It’s also more efficient than Whisper medium on GPU memory while delivering better WER on its supported languages.

When to choose Whisper: You need coverage for 99 languages (even with lower accuracy on lesser-resourced languages). Whisper models are also MIT-licensed, so they’re permissive for commercial use.

When to choose SeamlessM4T: You need translation across many more language pairs — but you’ll pay in VRAM (2.3B params) and latency.

Canary 1B’s trade-off is clear: focused, high-quality ASR and translation for four languages, with a commercial-restrictive license. If that fits your project, it’s a strong local inference choice.

Related Models

NVIDIA

Explore the Provider

See all NVIDIA models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every NVIDIA model we track.

Open NVIDIA

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

NVIDIA

NVIDIA Canary 1B

NVIDIA NeMo Canary 1B is a 1-billion-parameter multilingual encoder-decoder ASR and speech translation model supporting English, German, French, and Spanish.

1B paramsDense

View on Hugging Face Source Code Official Page

Our Take

Best for: Open-source asr workloads

A solid 1B-parameter dense audio model from NVIDIA. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters1B

ArchitectureDense

ProviderNVIDIA

Download Size8.4 GB

Community

Monthly Downloads1.8K

Likes457

Last Updated6 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

CC-BY-NC-4.0View Full License

Performance & Scoring

Benchmarks

WER

6.5%

MBA Open Score

67.9BB

Benchmark40%

87.0

Popularity25%

47.8

Efficiency25%

56.5

Versatility10%

70.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	1.1 GB
Acer Veriton GN100 AI MiniAcer	SS	1.1 GB
AMD Instinct MI300XAMD	SS	1.1 GB
AMD Instinct MI325XAMD	SS	1.1 GB
AMD Instinct MI355XAMD	SS	1.1 GB
AMD Radeon RX 7600 8GBAMD	SS	1.1 GB
AMD Radeon RX 7700 XTAMD	SS	1.1 GB
AMD Radeon RX 7800 XTAMD	SS	1.1 GB
AMD Radeon RX 7900 XTAMD	SS	1.1 GB
AMD Radeon RX 7900 XTXAMD	SS	1.1 GB
AMD Radeon RX 9070AMD	SS	1.1 GB
AMD Radeon RX 9070 XTAMD	SS	1.1 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	1.1 GB
Apple M4Apple	SS	1.1 GB
Apple M4 Max (40-core GPU)Apple	SS	1.1 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	1.1 GB
Apple M5Apple	SS	1.1 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	1.1 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	1.1 GB
Apple Mac Mini (M1, 2020)Apple	SS	1.1 GB
Apple Mac Mini (M2, 2023)Apple	SS	1.1 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	1.1 GB
Apple Mac Mini (M4, 2024)Apple	SS	1.1 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	1.1 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	1.1 GB

Rows per page

Page 1 of 5

Rent in the Cloud

Cheapest current cloud rentals with at least 1 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM	$0.13
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM	$0.13
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM	$0.13
NVIDIA GeForce RTX 4090Vast.ai · Spot · 24 GB VRAM	$0.13

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

Architecture & Technical Details

Capabilities & Use Cases

Canary 1B handles two primary tasks:

Automatic Speech Recognition (ASR): Transcribe audio in English, German, French, or Spanish. Word error rates (WER) on standard benchmarks are competitive — 2.89% on LibriSpeech other, 4.61% on Common Voice German, 3.99% on Spanish, 6.53% on French. Punctuation and capitalization are supported out of the box.

Automatic Speech Translation (AST): Bidirectional translation between English and each of the other three languages. For example, Spanish audio to English text, or English audio to German text. BLEU scores on FLEURS: En→De 32.15, De→En 33.98, En→Es 22.66, En→Fr 40.76.

Concrete use cases:

Local transcription for meeting recordings — run on a laptop with an RTX 4060 or M4 Max, no cloud egress.
Multilingual customer support — transcribe and translate agent-customer calls in real time or post-call analysis.
Subtitling and localization pipelines — batch audio files with language detection and translation in one model.
Accessibility tools — real-time captioning for live events with multilingual audiences.

Because it’s a 1B model, it runs comfortably on consumer GPUs (details below) and can even be deployed on Apple Silicon with respectable token-per-second rates.

Running NVIDIA Canary 1B Locally

This is where Canary 1B shines for practitioners. You can run it on a single consumer GPU or even on high-end integrated memory. Here’s what you need to know.

VRAM Requirements

Quantization	Min VRAM (approx.)	Recommended VRAM
FP16 (default)	2 GB	4 GB
Q8_0	1.5 GB	3 GB
Q4_K_M	1 GB	2 GB

These numbers are for single-stream inference. With batching or large audio segments, add 20-50%. The model is small enough that FP16 on a 6 GB GPU is perfectly viable.

Consumer Hardware

RTX 4090 (24 GB): Overkill. You can run multiple concurrent streams or batch inference. Expect near-real-time transcription (audio processing faster than real time).
RTX 4060 (8 GB): Excellent. Q8_0 or FP16, single stream. Real-time or faster.
RTX 3060 (12 GB): Comfortable at FP16, even with moderate batching.
M4 Max (64 GB unified): Runs FP16 effortlessly. Inference speed depends on CPU/GPU balance — expect ~0.2-0.5x real-time factor using MLX or llama.cpp backend.
Apple M1/M2 with 16 GB: Viable at Q4_K_M via llama.cpp or whisper.cpp-like implementations.

Performance (Tokens per Second)

ASR models are usually benchmarked by real-time factor (RTF) rather than tokens per second. For a 30-second audio clip:

RTX 4090 (FP16): RTF ~0.05 (processes 30s audio in ~1.5s)
RTX 4060 (Q4_K_M): RTF ~0.1 (30s in ~3s)
M4 Max (FP16): RTF ~0.08 (30s in ~2.4s)

These are estimates; actual performance varies by audio length, language, and implementation. For real-time streaming, most consumer GPUs can keep up with live microphone input.

Best Quantization for NVIDIA Canary 1B

How to Run

1import nemo.collections.asr as nemo_asr
2model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained("nvidia/canary-1b")
3transcription = model.transcribe(["audio.wav"])

Ollama’s llama.cpp backend doesn’t natively support encoder-decoder ASR yet, but whisper.cpp can be adapted — check the latest community builds for Canary support.

Hardware Requirements Summary

Minimum: 4 GB GPU VRAM (Q4_K_M), 8 GB system RAM, 16 kHz audio input.
Recommended: 8 GB GPU (RTX 4060 or better), 16 GB RAM, SSD for model storage (~1.7 GB for Q4_K_M).
Apple Silicon: 16 GB unified memory minimum; 24 GB+ recommended for batch or streaming.

How It Compares

Canary 1B competes directly with OpenAI Whisper medium (769M) and small (244M) on ASR, and with SeamlessM4T (2B) on translation — though SeamlessM4T is larger and covers more languages.

Model	Parameters	Languages	ASR WER (LibriSpeech other)	Translation	License
Canary 1B	1B	en, de, fr, es	2.89%	Bidirectional en↔de,fr,es	CC-BY-NC-4.0
Whisper medium	769M	99	4.5% (approx)	No native translation	MIT
Whisper small	244M	99	6.2% (approx)	No native translation	MIT
SeamlessM4T (2B)	2.3B	100+	N/A ASR focus	100+ languages	CC-BY-NC-4.0