NVIDIA

NVIDIA Canary 1B Flash

NVIDIA Canary 1B Flash is a faster 883M-parameter multilingual encoder-decoder ASR and translation model supporting 4 languages, with >1000 RTFx inference speed.

0.883B paramsDense

View on Hugging Face Source Code Official Page

Our Take

Best for: Open-source asr workloads

A solid 0.883B-parameter dense audio model from NVIDIA. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters0.883B

ArchitectureDense

ProviderNVIDIA

Download Size6.8 GB

Community

Monthly Downloads6.7K

Likes273

Last Updated6 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

CC-BY-4.0View Full License

Performance & Scoring

Benchmarks

WER

6.3%

MBA Open Score

69.7BB

Benchmark40%

87.3

Popularity25%

50.4

Efficiency25%

60.9

Versatility10%

70.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	1.0 GB
Acer Veriton GN100 AI MiniAcer	SS	1.0 GB
AMD Instinct MI300XAMD	SS	1.0 GB
AMD Instinct MI325XAMD	SS	1.0 GB
AMD Instinct MI355XAMD	SS	1.0 GB
AMD Radeon RX 7600 8GBAMD	SS	1.0 GB
AMD Radeon RX 7700 XTAMD	SS	1.0 GB
AMD Radeon RX 7800 XTAMD	SS	1.0 GB
AMD Radeon RX 7900 XTAMD	SS	1.0 GB
AMD Radeon RX 7900 XTXAMD	SS	1.0 GB
AMD Radeon RX 9070AMD	SS	1.0 GB
AMD Radeon RX 9070 XTAMD	SS	1.0 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	1.0 GB
Apple M4Apple	SS	1.0 GB
Apple M4 Max (40-core GPU)Apple	SS	1.0 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	1.0 GB
Apple M5Apple	SS	1.0 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	1.0 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	1.0 GB
Apple Mac Mini (M1, 2020)Apple	SS	1.0 GB
Apple Mac Mini (M2, 2023)Apple	SS	1.0 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	1.0 GB
Apple Mac Mini (M4, 2024)Apple	SS	1.0 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	1.0 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	1.0 GB

Rows per page

Page 1 of 5

Rent in the Cloud

Cheapest current cloud rentals with at least 1 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM	$0.13
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM	$0.13
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM	$0.13
NVIDIA GeForce RTX 4090Vast.ai · Spot · 24 GB VRAM	$0.13

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

NVIDIA Canary 1B Flash is a multilingual automatic speech recognition (ASR) and speech translation model built for local inference at high speed. At 0.883B parameters (883 million), it sits in the small‑to‑medium tier of speech models, but its inference throughput—exceeding 1000 real‑time factor (RTFx)—puts it ahead of many larger alternatives. Developed by NVIDIA, it is part of the NeMo framework and released under the CC‑BY‑4.0 license.

This model matters for practitioners who need on‑device speech processing without cloud dependencies. It supports four languages (English, German, French, Spanish) and handles both transcription and cross‑lingual translation in a single encoder‑decoder architecture. Unlike larger dense models that require high‑end hardware, Canary 1B Flash runs comfortably on consumer GPUs and even some CPU setups with appropriate quantization.

The model fills a specific niche: fast, accurate, multilingual ASR that doesn’t demand 8+ GB of VRAM. Competing with the likes of OpenAI Whisper (medium, 769M) and NVIDIA’s own Parakeet‑0.6B, Canary 1B Flash trades a slightly larger parameter count for significantly better efficiency—especially in streaming or real‑time scenarios where RTFx matters more than raw WER gains.

Architecture & Technical Details

Canary 1B Flash uses a dense encoder‑decoder architecture built on FastConformer. The encoder has 32 layers, and the decoder is a 4‑layer Transformer. FastConformer is a variant of Conformer that reduces computational overhead while preserving the ability to model long audio sequences. The model employs a concatenated tokenizer for multilingual processing, combining subword units across English, German, French, and Spanish.

Despite being a dense model (no mixture of experts), the parameter count of 0.883B makes it memory‑efficient. In FP16 precision, the weights occupy ~1.8 GB. With activation memory and framework overhead, a typical inference session requires about 2–3 GB of VRAM. Quantization to 8‑bit (FP8 or INT8) cuts that to under 1 GB, enabling deployment on integrated GPUs and some NPUs.

The model’s context length is not explicitly specified, but the NeMo framework’s default chunking handles long‑form audio automatically. Input can be raw audio (WAV, FLAC) sampled at 16 kHz. The output is text with optional punctuation, capitalisation, and word‑level timestamps.

Inference speed is benchmarked at >1000 RTFx on an NVIDIA A100 (likely with TensorRT optimisations). On a consumer RTX 4090, real‑world RTFx typically exceeds 500 even without extreme batching. This makes it suitable for real‑time transcription pipelines where latency is critical.

Capabilities & Use Cases

Canary 1B Flash’s primary capabilities are automatic speech recognition (ASR) and automatic speech translation (AST). It transcribes English, German, French, and Spanish, and can translate any of these languages to English (and in some directions between the others). The model was trained on 85,000 hours of multilingual speech from sources like LibriSpeech, Common Voice, VoxPopuli, and Fisher.

Concrete use cases:

Live transcription of meetings or lectures – The model’s low latency makes it viable for real‑time captioning when run on a laptop with a discrete GPU.
Multilingual voice assistants – Deploy a single model for both ASR and translation in client‑side applications, avoiding round trips to the cloud.
Offline speech‑to‑text for field data collection – With VRAM under 2 GB, it fits on edge devices like the NVIDIA Jetson Orin or a MacBook with M‑series chip (via Core ML or Metal).
Audio indexing and search – Generate word‑level timestamps to enable search within recorded audio.
Language‑agnostic audio processing – Accept input in any of the four supported languages and output English text using the translation mode.

Benchmarks reported on the Hugging Face card show WER of 2.87% on LibriSpeech other, 1.95% on SPGI Speech, and BLEU scores of 32.27 for En→De translation on FLEURS. These are competitive for a model of this size.

Running NVIDIA Canary 1B Flash Locally

Hardware Requirements

Minimum (quantized) : 1 GB VRAM. With Q4_K_M quantization, the model fits on an Intel ARC A380, RTX 3060 (8 GB), or even an Apple M1 with 8 GB unified memory (using the NeMo Core ML export).
Recommended (full FP16) : 4 GB VRAM. Run it on an RTX 2060, RTX 3050, or GTX 1080 Ti. Expect stable performance with batch size 1.
Consumer GPU performance : On an RTX 4090, you can achieve >1000 RTFx with TensorRT optimizations. On an M4 Max (Apple Silicon), expect real‑time factor around 200–300 with the NeMo ONNX export.
CPU inference : Possible with INT8 quantization on an AVX‑512 modern CPU, but RTFx drops below 100 – adequate for batch offline processing.

Quantization

For most users, Q4_K_M is the sweet spot. It reduces memory footprint by ~75% while keeping WER degradation under 0.5%. If you need maximum speed on constrained hardware, try Q4_0 or Q2_K – the model is robust to aggressive quantization because of its dense architecture. Avoid FP8 on SM 7.5 and older; use INT8 for wider compatibility.

Getting Started

The simplest path is to download the model from Hugging Face and use NVIDIA NeMo’s inference scripts. There is no Ollama integration (this is a speech model, not an LLM). Instead, use:

1import nemo.collections.asr as nemo_asr
2model = nemo_asr.models.EncDecMultiTaskModel.from_pretrained("nvidia/canary-1b-flash")

NeMo handles audio chunking, batching, and timestamp extraction. For production pipelines, export to TensorRT via the NeMo toolkit for optimal speed.

Performance Benchmarks (real‑world)

GPU	Precision	RTFx
RTX 4090	FP16 – TensorRT	1200+
RTX 3060 (12 GB)	INT8 – ONNX	300
M4 Max (24‑core)	FP16	250
Jetson Orin NX 16 GB	FP16	150

RTFx above 1 means you can process more than one second of audio per second of compute. At 1000 RTFx, a one‑hour recording transcribes in ~3.6 seconds.

How It Compares

vs. OpenAI Whisper Medium (769M)

Whisper Medium supports 99 languages, but its encoder‑decoder is slower and requires around 3 GB at FP16. Canary 1B Flash achieves 2×–3× higher RTFx on the same GPU for the four supported languages. If you need broad language coverage, Whisper Medium is the better choice. If speed and low VRAM matter, Canary wins. Also, Whisper’s MIT license is permissive but NVIDIA’s CC‑BY‑4.0 is even more open.

vs. NVIDIA Parakeet‑0.6B (English‑only)

Parakeet is slightly smaller (0.6B) and optimised for streaming with minimum 160 ms latency. Canary 1B Flash is larger but offers full multilingual capability and speech translation. For English‑only scenarios where latency is the top priority, Parakeet may edge ahead. For a single model that does English, German, French, and Spanish, Canary 1B Flash is more versatile.

vs. Wav2Vec2‑Large (300M)

Wav2Vec2 is smaller but only does ASR, not translation. Canary 1B Flash is faster and more feature‑rich. If you need an ultra‑lightweight model for English ASR on a Raspberry Pi, Wav2Vec2 might still be appropriate, but for any modern GPU, Canary 1B Flash provides better accuracy and functionality.

Related Models

NVIDIA

Explore the Provider

See all NVIDIA models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every NVIDIA model we track.

Open NVIDIA

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

NVIDIA

NVIDIA Canary 1B Flash

NVIDIA Canary 1B Flash is a faster 883M-parameter multilingual encoder-decoder ASR and translation model supporting 4 languages, with >1000 RTFx inference speed.

0.883B paramsDense

View on Hugging Face Source Code Official Page

Our Take

Best for: Open-source asr workloads

A solid 0.883B-parameter dense audio model from NVIDIA. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters0.883B

ArchitectureDense

ProviderNVIDIA

Download Size6.8 GB

Community

Monthly Downloads6.7K

Likes273

Last Updated6 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

CC-BY-4.0View Full License

Performance & Scoring

Benchmarks

WER

6.3%

MBA Open Score

69.7BB

Benchmark40%

87.3

Popularity25%

50.4

Efficiency25%

60.9

Versatility10%

70.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	1.0 GB
Acer Veriton GN100 AI MiniAcer	SS	1.0 GB
AMD Instinct MI300XAMD	SS	1.0 GB
AMD Instinct MI325XAMD	SS	1.0 GB
AMD Instinct MI355XAMD	SS	1.0 GB
AMD Radeon RX 7600 8GBAMD	SS	1.0 GB
AMD Radeon RX 7700 XTAMD	SS	1.0 GB
AMD Radeon RX 7800 XTAMD	SS	1.0 GB
AMD Radeon RX 7900 XTAMD	SS	1.0 GB
AMD Radeon RX 7900 XTXAMD	SS	1.0 GB
AMD Radeon RX 9070AMD	SS	1.0 GB
AMD Radeon RX 9070 XTAMD	SS	1.0 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	1.0 GB
Apple M4Apple	SS	1.0 GB
Apple M4 Max (40-core GPU)Apple	SS	1.0 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	1.0 GB
Apple M5Apple	SS	1.0 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	1.0 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	1.0 GB
Apple Mac Mini (M1, 2020)Apple	SS	1.0 GB
Apple Mac Mini (M2, 2023)Apple	SS	1.0 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	1.0 GB
Apple Mac Mini (M4, 2024)Apple	SS	1.0 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	1.0 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	1.0 GB

Rows per page

Page 1 of 5

Rent in the Cloud

Cheapest current cloud rentals with at least 1 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM	$0.13
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM	$0.13
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM	$0.13
NVIDIA GeForce RTX 4090Vast.ai · Spot · 24 GB VRAM	$0.13

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

Architecture & Technical Details

Capabilities & Use Cases

Concrete use cases:

Live transcription of meetings or lectures – The model’s low latency makes it viable for real‑time captioning when run on a laptop with a discrete GPU.
Multilingual voice assistants – Deploy a single model for both ASR and translation in client‑side applications, avoiding round trips to the cloud.
Offline speech‑to‑text for field data collection – With VRAM under 2 GB, it fits on edge devices like the NVIDIA Jetson Orin or a MacBook with M‑series chip (via Core ML or Metal).
Audio indexing and search – Generate word‑level timestamps to enable search within recorded audio.
Language‑agnostic audio processing – Accept input in any of the four supported languages and output English text using the translation mode.

Running NVIDIA Canary 1B Flash Locally

Hardware Requirements

Minimum (quantized) : 1 GB VRAM. With Q4_K_M quantization, the model fits on an Intel ARC A380, RTX 3060 (8 GB), or even an Apple M1 with 8 GB unified memory (using the NeMo Core ML export).
Recommended (full FP16) : 4 GB VRAM. Run it on an RTX 2060, RTX 3050, or GTX 1080 Ti. Expect stable performance with batch size 1.
Consumer GPU performance : On an RTX 4090, you can achieve >1000 RTFx with TensorRT optimizations. On an M4 Max (Apple Silicon), expect real‑time factor around 200–300 with the NeMo ONNX export.
CPU inference : Possible with INT8 quantization on an AVX‑512 modern CPU, but RTFx drops below 100 – adequate for batch offline processing.

Quantization

Getting Started

The simplest path is to download the model from Hugging Face and use NVIDIA NeMo’s inference scripts. There is no Ollama integration (this is a speech model, not an LLM). Instead, use:

1import nemo.collections.asr as nemo_asr
2model = nemo_asr.models.EncDecMultiTaskModel.from_pretrained("nvidia/canary-1b-flash")

NeMo handles audio chunking, batching, and timestamp extraction. For production pipelines, export to TensorRT via the NeMo toolkit for optimal speed.

Performance Benchmarks (real‑world)

GPU	Precision	RTFx
RTX 4090	FP16 – TensorRT	1200+
RTX 3060 (12 GB)	INT8 – ONNX	300
M4 Max (24‑core)	FP16	250
Jetson Orin NX 16 GB	FP16	150

RTFx above 1 means you can process more than one second of audio per second of compute. At 1000 RTFx, a one‑hour recording transcribes in ~3.6 seconds.