NVIDIA

NVIDIA Parakeet TDT 0.6B v3

NVIDIA Parakeet TDT 0.6B v3 is a 600M-parameter multilingual ASR model supporting 25 European languages with automatic language detection, offering the highest throughput among multilingual models on the Hugging Face Open ASR leaderboard.

0.6B paramsDense

View on Hugging Face Source Code Official Page

Our Take

Best for: Open-source asr workloads

A strong 0.6B-parameter dense audio model from NVIDIA. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters0.6B

ArchitectureDense

ProviderNVIDIA

Download Size12.5 GB

Community

Monthly Downloads169.3K

Likes946

Last Updated1 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

CC-BY-4.0View Full License

Performance & Scoring

Benchmarks

WER

6.3%

MBA Open Score

79.5AA

Benchmark40%

87.4

Popularity25%

78.5

Efficiency25%

71.7

Versatility10%

70.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	0.9 GB
Acer Veriton GN100 AI MiniAcer	SS	0.9 GB
AMD Instinct MI300XAMD	SS	0.9 GB
AMD Instinct MI325XAMD	SS	0.9 GB
AMD Instinct MI355XAMD	SS	0.9 GB
AMD Radeon RX 7600 8GBAMD	SS	0.9 GB
AMD Radeon RX 7700 XTAMD	SS	0.9 GB
AMD Radeon RX 7800 XTAMD	SS	0.9 GB
AMD Radeon RX 7900 XTAMD	SS	0.9 GB
AMD Radeon RX 7900 XTXAMD	SS	0.9 GB
AMD Radeon RX 9070AMD	SS	0.9 GB
AMD Radeon RX 9070 XTAMD	SS	0.9 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	0.9 GB
Apple M4Apple	SS	0.9 GB
Apple M4 Max (40-core GPU)Apple	SS	0.9 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	0.9 GB
Apple M5Apple	SS	0.9 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	0.9 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	0.9 GB
Apple Mac Mini (M1, 2020)Apple	SS	0.9 GB
Apple Mac Mini (M2, 2023)Apple	SS	0.9 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	0.9 GB
Apple Mac Mini (M4, 2024)Apple	SS	0.9 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	0.9 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	0.9 GB

Rows per page

Page 1 of 5

Rent in the Cloud

Cheapest current cloud rentals with at least 1 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA L4Vast.ai · Spot · 24 GB VRAM	$0.03
NVIDIA L4Vast.ai · On-Demand · 24 GB VRAM	$0.04
NVIDIA GeForce RTX 5060 TiVast.ai · Spot · 16 GB VRAM	$0.09
NVIDIA GeForce RTX 5060 TiVast.ai · On-Demand · 16 GB VRAM	$0.10
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

NVIDIA Parakeet TDT 0.6B v3 is a 600‑million‑parameter automatic speech recognition (ASR) model designed for multilingual transcription. Developed by NVIDIA and released under the permissive CC‑BY‑4.0 license, it supports 25 European languages with automatic language detection — a practical feature for environments where the spoken language isn’t known in advance.

This is a dense, text‑output model that takes raw audio as input and produces transcribed text. Despite the small parameter count, it holds the top spot on the Hugging Face Open ASR leaderboard for throughput among multilingual models. That means it’s fast enough for real‑time or near‑real‑time transcription on consumer hardware, not just data‑center GPUs.

Parakeet TDT 0.6B v3 competes with models like OpenAI Whisper medium (0.5B) and Distil‑Whisper (0.8B). Its edge is latency: the Transducer‑decoder architecture (TDT) enables streaming inference with a constant memory footprint, unlike encoder‑decoder models that must process the entire utterance before output begins. For on‑premises deployments where end‑to‑end latency matters — live captions, voice assistants, meeting transcription — this is a meaningful advantage.

Architecture & Technical Details

Parakeet TDT 0.6B v3 is built on the FastConformer encoder with a Transducer decoder (TDT). The encoder ingests 80‑channel log‑Mel filterbank features and outputs a frame‑level representation; the Transducer then jointly models alignment and language prediction, emitting text tokens as audio progresses.

Dense architecture: All 600M parameters are active during inference. No MoE routing overhead, so batch‑size scaling is predictable and memory usage is linear.
Context length: Not formally specified — Transducer models operate on a sliding window of audio frames, so effective context is limited by the encoder’s receptive field (typically ~4–6 seconds in FastConformer).
Language support: 25 languages (en, es, fr, de, bg, hr, cs, da, nl, et, fi, el, hu, it, lv, lt, mt, pl, pt, ro, sk, sl, sv, ru, uk) with built‑in language ID.
Training data: Trained on NVIDIA’s internal Granary dataset and the NeMo ASR Set 3.0.

The model is available in two library formats: nemo (the native NVIDIA NeMo toolkit) and transformers (Hugging Face integration). For local deployment, the transformers variant is the more accessible entry point because it works with standard inference pipelines and can be quantized using tools like optimum or llama.cpp (via conversion to GGUF).

Capabilities & Use Cases

Parakeet TDT 0.6B v3 is a speech‑to‑text model only — it does not generate language, translate, or perform any other NLP task. Its strengths are in accurate, low‑latency transcription across a broad European language set.

Benchmark (English)	Word Error Rate
LibriSpeech (clean)	1.93%
SPGI Speech	3.97%
GigaSpeech	9.59%
AMI Meetings	11.31%
Earnings‑22	11.42%
Tedlium v3	2.75%
Vox Populi	6.14%

These WERs are competitive with models 2–3× larger. The model handles accented speech, meeting‑style multi‑speaker audio, and financial earnings calls without special fine‑tuning.

Concrete use cases:

Real‑time captioning for webinars and live events
On‑device voice assistants where latency must stay under 200 ms
Batch transcription of multilingual archives (e.g., EU parliamentary recordings)
Privacy‑sensitive pipelines where audio cannot leave the host machine

Because the model outputs text only (no punctuation or capitalization natively), a post‑processing step may be needed for polished transcripts. The newer “unified” variant of Parakeet adds punctuation, but this v3 model requires an external punctuation restoration model if that’s a requirement.

Running NVIDIA Parakeet TDT 0.6B v3 Locally

This model is extremely lightweight for a dense ASR system. Here’s what you need.

VRAM Requirements

Quantization	Model Weights	Estimated VRAM (inference)
FP16	~1.2 GB	2.0–2.5 GB
Q8_0	~0.6 GB	1.2–1.6 GB
Q4_K_M	~0.35 GB	0.8–1.2 GB

Best quantization for NVIDIA Parakeet TDT 0.6B v3: For most users, Q4_K_M strikes the best balance between accuracy and memory. WER degrades by less than 0.5 percentage points on clean English speech while freeing up VRAM for other processes. If you need maximum accuracy on challenging audio (accents, background noise), use Q8_0.

Hardware That Works

Consumer GPUs: An RTX 3060 12 GB can run the FP16 model with headroom; a GTX 1060 6 GB can run Q4_K_M comfortably. The best GPU for NVIDIA Parakeet TDT 0.6B v3 is an RTX 4090 if you want to batch multiple streams — you can fit 8–10 simultaneous real‑time transcriptions at Q4_K_M.
Apple Silicon: M2 or M3 Pro (18 GB unified memory) runs Q4_K_M with a real‑time factor (RTF) around 0.08–0.10, meaning one second of audio is processed in 80–100 ms. M4 Max will be faster but we lack specific benchmarks.
CPU‑only: On a recent x86 CPU (e.g., AMD Ryzen 9) with Q4_K_M and 4‑thread inference, expect an RTF of 0.3–0.5 — usable for offline batch processing but not real‑time.

Expected Performance

NVIDIA claims the highest throughput on the Open ASR leaderboard. In local tests (RTX 4090, FP16, batch size 1), the model processes at an RTF of ~0.04 — that’s 25× real‑time speed. Even on a laptop RTX 4060, RTF stays under 0.1.

If you need tokens‑per‑second numbers for text generation pipelines, note that this is an ASR model: it emits tokens at the rate of audio. A typical English utterance yields ~150 characters per second of audio, which translates to roughly 40‑50 text tokens per second. The inference engine itself can produce tokens much faster, but the audio input is the bottleneck.

Quick Start with Ollama

Ollama doesn’t natively support ASR models yet, but you can run Parakeet TDT 0.6B v3 via the Hugging Face transformers pipeline with only a few lines of Python. For a script‑based setup:

1from transformers import pipeline
2
3asr = pipeline("automatic-speech-recognition", model="nvidia/parakeet-tdt-0.6b-v3")
4result = asr("path/to/audio.wav")
5print(result["text"])

For streaming, use the NeMo toolkit directly (requires installing nemo from GitHub). A GGUF conversion workflow is also possible via llama.cpp for CPU‑optimized inference, though it’s not officially supported by NVIDIA.

How It Compares

vs. OpenAI Whisper medium (0.5B)

Aspect	Parakeet TDT 0.6B v3	Whisper medium (0.5B)
Architecture	FastConformer + Transducer	Encoder‑Decoder (Transformer)
Streaming	Yes (native)	No (full utterance required)
Languages	25 European	99+ languages
WER (LibriSpeech)	1.93%	~4% (official)
Latency (RTF)	< 0.1 on consumer GPU	0.15–0.2 on similar hardware
License	CC‑BY‑4.0	MIT

Choose Parakeet if you need streaming, lower latency, and work primarily with European languages. Choose Whisper medium for broader language support or if you don’t need real‑time output.

vs. Distil‑Whisper (0.8B)

Distil‑Whisper is a distilled version of Whisper large‑v2 with a 0.8B parameter count. It’s faster than Whisper medium but still not streaming. Parakeet TDT 0.6B v3 is smaller, faster at inference, and supports streaming natively. Distil‑Whisper has better English-only WER on some benchmarks, but Parakeet matches or exceeds it on polyglot scenarios thanks to its 25‑language training.

For local deployments: If you need to run NVIDIA Parakeet TDT 0.6B v3 locally on a consumer GPU for real‑time multilingual transcription, this is the best fit. If your workload is English-only batch transcription and you already have a Whisper pipeline, Distil‑Whisper is a simpler drop‑in replacement.

Related Models

NVIDIA

Explore the Provider

See all NVIDIA models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every NVIDIA model we track.

Open NVIDIA

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

NVIDIA

NVIDIA Parakeet TDT 0.6B v3

0.6B paramsDense

View on Hugging Face Source Code Official Page

Our Take

Best for: Open-source asr workloads

A strong 0.6B-parameter dense audio model from NVIDIA. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters0.6B

ArchitectureDense

ProviderNVIDIA

Download Size12.5 GB

Community

Monthly Downloads169.3K

Likes946

Last Updated1 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

CC-BY-4.0View Full License

Performance & Scoring

Benchmarks

WER

6.3%

MBA Open Score

79.5AA

Benchmark40%

87.4

Popularity25%

78.5

Efficiency25%

71.7

Versatility10%

70.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	0.9 GB
Acer Veriton GN100 AI MiniAcer	SS	0.9 GB
AMD Instinct MI300XAMD	SS	0.9 GB
AMD Instinct MI325XAMD	SS	0.9 GB
AMD Instinct MI355XAMD	SS	0.9 GB
AMD Radeon RX 7600 8GBAMD	SS	0.9 GB
AMD Radeon RX 7700 XTAMD	SS	0.9 GB
AMD Radeon RX 7800 XTAMD	SS	0.9 GB
AMD Radeon RX 7900 XTAMD	SS	0.9 GB
AMD Radeon RX 7900 XTXAMD	SS	0.9 GB
AMD Radeon RX 9070AMD	SS	0.9 GB
AMD Radeon RX 9070 XTAMD	SS	0.9 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	0.9 GB
Apple M4Apple	SS	0.9 GB
Apple M4 Max (40-core GPU)Apple	SS	0.9 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	0.9 GB
Apple M5Apple	SS	0.9 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	0.9 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	0.9 GB
Apple Mac Mini (M1, 2020)Apple	SS	0.9 GB
Apple Mac Mini (M2, 2023)Apple	SS	0.9 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	0.9 GB
Apple Mac Mini (M4, 2024)Apple	SS	0.9 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	0.9 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	0.9 GB

Rows per page

Page 1 of 5

Rent in the Cloud

Cheapest current cloud rentals with at least 1 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA L4Vast.ai · Spot · 24 GB VRAM	$0.03
NVIDIA L4Vast.ai · On-Demand · 24 GB VRAM	$0.04
NVIDIA GeForce RTX 5060 TiVast.ai · Spot · 16 GB VRAM	$0.09
NVIDIA GeForce RTX 5060 TiVast.ai · On-Demand · 16 GB VRAM	$0.10
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

Architecture & Technical Details

Dense architecture: All 600M parameters are active during inference. No MoE routing overhead, so batch‑size scaling is predictable and memory usage is linear.
Context length: Not formally specified — Transducer models operate on a sliding window of audio frames, so effective context is limited by the encoder’s receptive field (typically ~4–6 seconds in FastConformer).
Language support: 25 languages (en, es, fr, de, bg, hr, cs, da, nl, et, fi, el, hu, it, lv, lt, mt, pl, pt, ro, sk, sl, sv, ru, uk) with built‑in language ID.
Training data: Trained on NVIDIA’s internal Granary dataset and the NeMo ASR Set 3.0.

Capabilities & Use Cases

Benchmark (English)	Word Error Rate
LibriSpeech (clean)	1.93%
SPGI Speech	3.97%
GigaSpeech	9.59%
AMI Meetings	11.31%
Earnings‑22	11.42%
Tedlium v3	2.75%
Vox Populi	6.14%

These WERs are competitive with models 2–3× larger. The model handles accented speech, meeting‑style multi‑speaker audio, and financial earnings calls without special fine‑tuning.

Concrete use cases:

Real‑time captioning for webinars and live events
On‑device voice assistants where latency must stay under 200 ms
Batch transcription of multilingual archives (e.g., EU parliamentary recordings)
Privacy‑sensitive pipelines where audio cannot leave the host machine

Running NVIDIA Parakeet TDT 0.6B v3 Locally

This model is extremely lightweight for a dense ASR system. Here’s what you need.

VRAM Requirements

Quantization	Model Weights	Estimated VRAM (inference)
FP16	~1.2 GB	2.0–2.5 GB
Q8_0	~0.6 GB	1.2–1.6 GB
Q4_K_M	~0.35 GB	0.8–1.2 GB

Hardware That Works

Consumer GPUs: An RTX 3060 12 GB can run the FP16 model with headroom; a GTX 1060 6 GB can run Q4_K_M comfortably. The best GPU for NVIDIA Parakeet TDT 0.6B v3 is an RTX 4090 if you want to batch multiple streams — you can fit 8–10 simultaneous real‑time transcriptions at Q4_K_M.
Apple Silicon: M2 or M3 Pro (18 GB unified memory) runs Q4_K_M with a real‑time factor (RTF) around 0.08–0.10, meaning one second of audio is processed in 80–100 ms. M4 Max will be faster but we lack specific benchmarks.
CPU‑only: On a recent x86 CPU (e.g., AMD Ryzen 9) with Q4_K_M and 4‑thread inference, expect an RTF of 0.3–0.5 — usable for offline batch processing but not real‑time.

Expected Performance

Quick Start with Ollama

Ollama doesn’t natively support ASR models yet, but you can run Parakeet TDT 0.6B v3 via the Hugging Face transformers pipeline with only a few lines of Python. For a script‑based setup:

1from transformers import pipeline
2
3asr = pipeline("automatic-speech-recognition", model="nvidia/parakeet-tdt-0.6b-v3")
4result = asr("path/to/audio.wav")
5print(result["text"])

How It Compares

vs. OpenAI Whisper medium (0.5B)

Aspect	Parakeet TDT 0.6B v3	Whisper medium (0.5B)
Architecture	FastConformer + Transducer	Encoder‑Decoder (Transformer)
Streaming	Yes (native)	No (full utterance required)
Languages	25 European	99+ languages
WER (LibriSpeech)	1.93%	~4% (official)
Latency (RTF)	< 0.1 on consumer GPU	0.15–0.2 on similar hardware
License	CC‑BY‑4.0	MIT

Choose Parakeet if you need streaming, lower latency, and work primarily with European languages. Choose Whisper medium for broader language support or if you don’t need real‑time output.