NVIDIA

NVIDIA Parakeet TDT 0.6B v2

NVIDIA Parakeet TDT 0.6B v2 is a 600M-parameter English ASR model that topped the Hugging Face Open ASR leaderboard in May 2025 with a 6.05% WER, capable of transcribing an hour of audio in a single second.

0.6B paramsDense


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	0.9 GB
Acer Veriton GN100 AI MiniAcer	SS	0.9 GB
AMD Instinct MI300XAMD	SS	0.9 GB
AMD Instinct MI325XAMD	SS	0.9 GB
AMD Instinct MI355XAMD	SS	0.9 GB
AMD Radeon RX 7600 8GBAMD	SS	0.9 GB
AMD Radeon RX 7700 XTAMD	SS	0.9 GB
AMD Radeon RX 7800 XTAMD	SS	0.9 GB
AMD Radeon RX 7900 XTAMD	SS	0.9 GB
AMD Radeon RX 7900 XTXAMD	SS	0.9 GB
AMD Radeon RX 9070AMD	SS	0.9 GB
AMD Radeon RX 9070 XTAMD	SS	0.9 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	0.9 GB
Apple M4Apple	SS	0.9 GB
Apple M4 Max (40-core GPU)Apple	SS	0.9 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	0.9 GB
Apple M5Apple	SS	0.9 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	0.9 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	0.9 GB
Apple Mac Mini (M1, 2020)Apple	SS	0.9 GB
Apple Mac Mini (M2, 2023)Apple	SS	0.9 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	0.9 GB
Apple Mac Mini (M4, 2024)Apple	SS	0.9 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	0.9 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	0.9 GB

About This Model

NVIDIA Parakeet TDT 0.6B v2

Parakeet-tdt-0.6b-v2 is a 600-million-parameter automatic speech recognition model designed for high-quality English transcription. It features punctuation, capitalization, and accurate word-level timestamp prediction. At release, it topped the Hugging Face Open ASR Leaderboard with a 6.05% WER and achieves an RTFx of ~3380 on HF-Open-ASR (batch size 128), transcribing ~60 minutes of audio in ~1 second.

Architecture: XL variant of the FastConformer architecture with a Token-and-Duration Transducer (TDT) decoder, trained with full attention, enabling transcription of audio segments up to 24 minutes in a single pass.

Training: Initialized from a FastConformer SSL checkpoint pretrained on LibriLight using wav2vec, then trained for 150,000 steps on 64 A100 GPUs. Stage 2 fine-tuning performed for 2,500 steps on 4 A100 GPUs using ~500 hours of high-quality human-transcribed data from NeMo ASR Set 3.0. Total training corpus: ~120,000 hours of English from the Granary dataset (10K human-transcribed + 110K pseudo-labeled).

Use cases: High-throughput English transcription, subtitle generation, voice assistants, call-center analytics, song-to-lyrics transcription, telephony transcription. Runs on Nvidia GPUs (A100/H100/T4/V100); loadable with as little as 2 GB RAM.

NVIDIA Parakeet TDT 0.6B v2

Our Take

Model Specifications

Quick Start

Download from Hugging Face

License

Performance & Scoring

Benchmarks

Overall Score

Hardware Compatibility

About This Model

NVIDIA Parakeet TDT 0.6B v2

Related Models

NVIDIA Canary-Qwen 2.5B

NVIDIA Parakeet CTC 1.1B

NVIDIA Parakeet RNNT 1.1B

Find the Best Hardware for This Model

Community

NVIDIA Parakeet TDT 1.1B