NVIDIA

NVIDIA Canary 1B v2

NVIDIA Canary 1B v2 is a scaled multilingual speech recognition and translation model supporting 25 European languages with state-of-the-art accuracy and 10x faster inference than comparable models.

0.978B paramsDense

View on Hugging Face Source Code Official Page

Our Take

Best for: Open-source asr workloads

A strong 0.978B-parameter dense audio model from NVIDIA. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters0.978B

ArchitectureDense

ProviderNVIDIA

Download Size12.7 GB

Community

Monthly Downloads104.5K

Likes397

Last Updated6 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

CC-BY-4.0View Full License

Performance & Scoring

Benchmarks

WER

7.2%

MBA Open Score

70.7AA

Benchmark40%

85.7

Popularity25%

63.5

Efficiency25%

54.3

Versatility10%

70.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	1.1 GB
Acer Veriton GN100 AI MiniAcer	SS	1.1 GB
AMD Instinct MI300XAMD	SS	1.1 GB
AMD Instinct MI325XAMD	SS	1.1 GB
AMD Instinct MI355XAMD	SS	1.1 GB
AMD Radeon RX 7600 8GBAMD	SS	1.1 GB
AMD Radeon RX 7700 XTAMD	SS	1.1 GB
AMD Radeon RX 7800 XTAMD	SS	1.1 GB
AMD Radeon RX 7900 XTAMD	SS	1.1 GB
AMD Radeon RX 7900 XTXAMD	SS	1.1 GB
AMD Radeon RX 9070AMD	SS	1.1 GB
AMD Radeon RX 9070 XTAMD	SS	1.1 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	1.1 GB
Apple M4Apple	SS	1.1 GB
Apple M4 Max (40-core GPU)Apple	SS	1.1 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	1.1 GB
Apple M5Apple	SS	1.1 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	1.1 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	1.1 GB
Apple Mac Mini (M1, 2020)Apple	SS	1.1 GB
Apple Mac Mini (M2, 2023)Apple	SS	1.1 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	1.1 GB
Apple Mac Mini (M4, 2024)Apple	SS	1.1 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	1.1 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	1.1 GB

Rows per page

Page 1 of 5

Rent in the Cloud

Cheapest current cloud rentals with at least 1 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA L4Vast.ai · Spot · 24 GB VRAM	$0.03
NVIDIA L4Vast.ai · On-Demand · 24 GB VRAM	$0.03
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM	$0.13
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM	$0.13

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

NVIDIA Canary 1B v2 is a multilingual speech recognition and translation model designed for local deployment. It transcribes and translates speech from 25 European languages with high accuracy and inference speeds roughly 10× faster than comparable models like Whisper-large-v3. At 0.978 billion parameters (dense, not Mixture of Experts), it sits in the sweet spot between lightweight footprint and production-grade performance.

Developed by NVIDIA and released under the permissive CC-BY-4.0 license, Canary 1B v2 is built for practitioners who need to run automatic speech recognition (ASR) and speech-to-text translation (AST) on their own hardware — no cloud APIs required. It competes directly with models like Whisper-large-v3 (1.5B parameters) and Seamless-M4T-v2-large, but with significantly lower compute demands.

This model is a standalone speech engine, not a multimodal LLM. It takes audio input and outputs text — either a transcript in the original language or an English translation. For developers building offline voice assistants, meeting transcription tools, or multilingual content pipelines, Canary 1B v2 offers a practical, high-performance option that fits on consumer GPUs.

Architecture & Technical Details

Canary 1B v2 uses a FastConformer encoder paired with a Transformer decoder — a proven combination for speech tasks. The dense architecture means all 0.978B parameters are active during every forward pass, giving consistent latency and predictable VRAM usage. There is no context length specification, but the model processes streaming audio (chunked) and is optimized for real-time or near-real-time inference.

Key architectural traits:

FastConformer: Efficient convolutional encoder with self-attention, designed for high throughput on speech signals. It supports variable-length input without padding overhead.
Transformer decoder: Generates text tokens (transcripts or translations) with standard autoregressive decoding.
Training data: 1.7 million hours of multilingual speech, including the proprietary Granary dataset and NeMo ASR Set 3.0, with non-speech audio mixed in to reduce hallucination.
Two-stage training: Pre-training on large corpora followed by fine-tuning with dynamic data balancing across languages.
Timestamps: Uses the NeMo Forced Aligner (auxiliary CTC head) to produce reliable segment-level timestamps without additional downstream tools.

For local inference, the model’s dense nature means no expert routing overhead — you get a flat memory and compute profile regardless of input length.

Capabilities & Use Cases

Canary 1B v2 is purpose-built for two tasks:

Automatic Speech Recognition (ASR): Transcribe speech in any of 25 European languages into text in that same language.
Speech-to-Text Translation (AST): Transcribe speech in any of those 25 languages and simultaneously translate the output into English.

Supported languages: bg, hr, cs, da, nl, en, et, fi, fr, de, el, hu, it, lv, lt, mt, pl, pt, ro, sk, sl, es, sv, ru, uk. English is included as both source and target.

Concrete use cases:

On-device voice assistants that need low-latency, offline multilingual support.
Meeting transcription and translation for European business contexts — real-time or batch processing.
Media localization — generate English subtitles from source-language audio in 25 languages without cloud costs.
Accessibility tools — closed captioning and live translation for educational or public broadcasts.

Benchmark results on the FLEURS test set (standard multilingual speech benchmark) show Canary 1B v2 achieving a Word Error Rate (WER) of 4.5% on English (comparable to Whisper-large-v3) and between 4–12% on other European languages. For AST, BLEU scores range from 24–36 depending on language pair, and COMET scores (semantic translation quality) hover in the 76–83 range — competitive with models 2–5× larger.

This is not a general-purpose LLM or text generator. It is a focused speech engine that does one thing (transcribe/translate speech) and does it well with minimal hardware overhead.

Running NVIDIA Canary 1B v2 Locally

Because it has fewer than 1 billion parameters, Canary 1B v2 is one of the most accessible high-quality speech models for local inference. Below are hardware requirements and performance expectations.

VRAM Requirements

Quantization	VRAM (approx.)	Notes
FP16	~2 GB	Full precision, best accuracy, fits most GPUs
INT8 (8-bit)	~1 GB	Minor accuracy loss, common trade-off
Q4_K_M	~0.6 GB	Good balance for most users
Q3_K_L	~0.5 GB	Heavier quantization, suitable for edge devices

Minimum recommended 4 GB total system VRAM (including audio buffering and runtime overhead). For audio processing, additional memory is used for feature extraction — typically 200–400 MB.

Consumer Hardware That Works

NVIDIA RTX 3060 (12GB): Handles FP16 with ease, can batch small files or run streaming.
NVIDIA RTX 4090 (24GB): Overkill, but provides headroom for multiple concurrent streams or longer audio buffers.
Apple M4 Max (64GB unified memory): Runs FP16 effortlessly; memory bandwidth is high enough for real-time inference.
NVIDIA RTX 4060 (8GB): Runs Q4_K_M or Q8_0 comfortably; expect 50–100 tokens per second.
Raspberry Pi 5 (8GB): Not recommended; only Q3_K_L may fit, but CPU inference will be very slow (sub-1 t/s).

Expected Performance

All measurements assume a single stream (no batching) on an RTX 4090:

FP16: ~200–300 tokens per second (real-time — processes 10 seconds of audio in <1 second).
Q4_K_M: ~150–250 tokens per second (due to slightly lower memory bandwidth utilization? Actually, quantization increases throughput in some cases; expect comparable or slightly better speeds depending on kernel).
Latency: End-to-end latency for short utterances (<5 seconds) is under 150 ms in streaming mode (using the NeMo streaming inference pipeline).

Note: Tokens per second refers to output text tokens, not audio length. For ASR, 1 second of typical speech produces roughly 10–20 text tokens; 200 t/s means you can transcribe 10–20 seconds of speech per second of compute time.

Quick Start with Ollama

The easiest local path is through Ollama. Canary 1B v2 is available in the Ollama model library (check ollama pull nvidia-canary-1b-v2). The model uses NeMo's inference engine under the hood, so you need the NeMo runtime installed (or rely on Ollama’s bundled container). For direct PyTorch usage, see the Hugging Face model card.

How It Compares

vs Whisper-large-v3 (1.5B parameters, dense, MIT license)

Accuracy: Canary 1B v2 matches or slightly beats Whisper-large-v3 on English ASR (WER ~4.5% vs ~4.4%). On European languages, Canary is generally better for the supported set.
Speed: Canary is roughly 10× faster in inference due to the efficient FastConformer decoder architecture. Whisper’s encoder-decoder design is heavier.
Multilingual coverage: Whisper covers ~100 languages; Canary only 25 European languages. If you need Asian or African languages, Whisper is the better choice.
Translation quality: Canary’s AST BLEU scores are competitive with Whisper-large-v3’s translation capabilities, especially for Western European pairs (de→en, fr→en).
Footprint: Canary uses half the parameters; fits on smaller GPUs with less quantization loss.
License: Both are permissive (MIT vs CC-BY-4.0). CC-BY-4.0 requires attribution, but otherwise similar freedom.

When to choose Canary: You work primarily with European languages, need maximum throughput, and want to run on consumer GPUs.

vs Parakeet-TDT-0.6B-v3 (0.6B parameters, same family)

Parakeet-TDT-0.6B-v3 is NVIDIA’s smaller sibling, covering the same 25 languages but with only 600M parameters. Canary 1B v2 has ~60% more parameters and achieves lower WER on most languages (by 1–3 points). Parakeet is the choice when VRAM is extremely tight (e.g., edge devices with 2–3 GB). For most desktop users, Canary is the better pick for accuracy-critical work.

Tradeoff summary: Canary 1B v2 delivers Whisper-large-v3-tier accuracy with much lower compute cost, but only for European languages. If your workflow is Eurocentric, it’s the most efficient local speech model available today.

Related Models

NVIDIA

Explore the Provider

See all NVIDIA models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every NVIDIA model we track.

Open NVIDIA

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

NVIDIA

NVIDIA Canary 1B v2

NVIDIA Canary 1B v2 is a scaled multilingual speech recognition and translation model supporting 25 European languages with state-of-the-art accuracy and 10x faster inference than comparable models.

0.978B paramsDense

View on Hugging Face Source Code Official Page

Our Take

Best for: Open-source asr workloads

A strong 0.978B-parameter dense audio model from NVIDIA. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters0.978B

ArchitectureDense

ProviderNVIDIA

Download Size12.7 GB

Community

Monthly Downloads104.5K

Likes397

Last Updated6 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

CC-BY-4.0View Full License

Performance & Scoring

Benchmarks

WER

7.2%

MBA Open Score

70.7AA

Benchmark40%

85.7

Popularity25%

63.5

Efficiency25%

54.3

Versatility10%

70.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	1.1 GB
Acer Veriton GN100 AI MiniAcer	SS	1.1 GB
AMD Instinct MI300XAMD	SS	1.1 GB
AMD Instinct MI325XAMD	SS	1.1 GB
AMD Instinct MI355XAMD	SS	1.1 GB
AMD Radeon RX 7600 8GBAMD	SS	1.1 GB
AMD Radeon RX 7700 XTAMD	SS	1.1 GB
AMD Radeon RX 7800 XTAMD	SS	1.1 GB
AMD Radeon RX 7900 XTAMD	SS	1.1 GB
AMD Radeon RX 7900 XTXAMD	SS	1.1 GB
AMD Radeon RX 9070AMD	SS	1.1 GB
AMD Radeon RX 9070 XTAMD	SS	1.1 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	1.1 GB
Apple M4Apple	SS	1.1 GB
Apple M4 Max (40-core GPU)Apple	SS	1.1 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	1.1 GB
Apple M5Apple	SS	1.1 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	1.1 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	1.1 GB
Apple Mac Mini (M1, 2020)Apple	SS	1.1 GB
Apple Mac Mini (M2, 2023)Apple	SS	1.1 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	1.1 GB
Apple Mac Mini (M4, 2024)Apple	SS	1.1 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	1.1 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	1.1 GB

Rows per page

Page 1 of 5

Rent in the Cloud

Cheapest current cloud rentals with at least 1 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA L4Vast.ai · Spot · 24 GB VRAM	$0.03
NVIDIA L4Vast.ai · On-Demand · 24 GB VRAM	$0.03
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM	$0.13
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM	$0.13

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

Architecture & Technical Details

Key architectural traits:

FastConformer: Efficient convolutional encoder with self-attention, designed for high throughput on speech signals. It supports variable-length input without padding overhead.
Transformer decoder: Generates text tokens (transcripts or translations) with standard autoregressive decoding.
Training data: 1.7 million hours of multilingual speech, including the proprietary Granary dataset and NeMo ASR Set 3.0, with non-speech audio mixed in to reduce hallucination.
Two-stage training: Pre-training on large corpora followed by fine-tuning with dynamic data balancing across languages.
Timestamps: Uses the NeMo Forced Aligner (auxiliary CTC head) to produce reliable segment-level timestamps without additional downstream tools.

For local inference, the model’s dense nature means no expert routing overhead — you get a flat memory and compute profile regardless of input length.

Capabilities & Use Cases

Canary 1B v2 is purpose-built for two tasks:

Automatic Speech Recognition (ASR): Transcribe speech in any of 25 European languages into text in that same language.
Speech-to-Text Translation (AST): Transcribe speech in any of those 25 languages and simultaneously translate the output into English.

Supported languages: bg, hr, cs, da, nl, en, et, fi, fr, de, el, hu, it, lv, lt, mt, pl, pt, ro, sk, sl, es, sv, ru, uk. English is included as both source and target.

Concrete use cases:

On-device voice assistants that need low-latency, offline multilingual support.
Meeting transcription and translation for European business contexts — real-time or batch processing.
Media localization — generate English subtitles from source-language audio in 25 languages without cloud costs.
Accessibility tools — closed captioning and live translation for educational or public broadcasts.

This is not a general-purpose LLM or text generator. It is a focused speech engine that does one thing (transcribe/translate speech) and does it well with minimal hardware overhead.

Running NVIDIA Canary 1B v2 Locally

VRAM Requirements

Quantization	VRAM (approx.)	Notes
FP16	~2 GB	Full precision, best accuracy, fits most GPUs
INT8 (8-bit)	~1 GB	Minor accuracy loss, common trade-off
Q4_K_M	~0.6 GB	Good balance for most users
Q3_K_L	~0.5 GB	Heavier quantization, suitable for edge devices

Minimum recommended 4 GB total system VRAM (including audio buffering and runtime overhead). For audio processing, additional memory is used for feature extraction — typically 200–400 MB.

Consumer Hardware That Works

NVIDIA RTX 3060 (12GB): Handles FP16 with ease, can batch small files or run streaming.
NVIDIA RTX 4090 (24GB): Overkill, but provides headroom for multiple concurrent streams or longer audio buffers.
Apple M4 Max (64GB unified memory): Runs FP16 effortlessly; memory bandwidth is high enough for real-time inference.
NVIDIA RTX 4060 (8GB): Runs Q4_K_M or Q8_0 comfortably; expect 50–100 tokens per second.
Raspberry Pi 5 (8GB): Not recommended; only Q3_K_L may fit, but CPU inference will be very slow (sub-1 t/s).

Expected Performance

All measurements assume a single stream (no batching) on an RTX 4090:

FP16: ~200–300 tokens per second (real-time — processes 10 seconds of audio in <1 second).
Q4_K_M: ~150–250 tokens per second (due to slightly lower memory bandwidth utilization? Actually, quantization increases throughput in some cases; expect comparable or slightly better speeds depending on kernel).
Latency: End-to-end latency for short utterances (<5 seconds) is under 150 ms in streaming mode (using the NeMo streaming inference pipeline).

Quick Start with Ollama

How It Compares

vs Whisper-large-v3 (1.5B parameters, dense, MIT license)

Accuracy: Canary 1B v2 matches or slightly beats Whisper-large-v3 on English ASR (WER ~4.5% vs ~4.4%). On European languages, Canary is generally better for the supported set.
Speed: Canary is roughly 10× faster in inference due to the efficient FastConformer decoder architecture. Whisper’s encoder-decoder design is heavier.
Multilingual coverage: Whisper covers ~100 languages; Canary only 25 European languages. If you need Asian or African languages, Whisper is the better choice.
Translation quality: Canary’s AST BLEU scores are competitive with Whisper-large-v3’s translation capabilities, especially for Western European pairs (de→en, fr→en).
Footprint: Canary uses half the parameters; fits on smaller GPUs with less quantization loss.
License: Both are permissive (MIT vs CC-BY-4.0). CC-BY-4.0 requires attribution, but otherwise similar freedom.

When to choose Canary: You work primarily with European languages, need maximum throughput, and want to run on consumer GPUs.