NVIDIA Parakeet TDT 0.6B v2 is a 600M-parameter English ASR model that topped the Hugging Face Open ASR leaderboard in May 2025 with a 6.05% WER, capable of transcribing an hour of audio in a single second.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Parakeet-tdt-0.6b-v2 is a 600-million-parameter automatic speech recognition model designed for high-quality English transcription. It features punctuation, capitalization, and accurate word-level timestamp prediction. At release, it topped the Hugging Face Open ASR Leaderboard with a 6.05% WER and achieves an RTFx of ~3380 on HF-Open-ASR (batch size 128), transcribing ~60 minutes of audio in ~1 second.
Architecture: XL variant of the FastConformer architecture with a Token-and-Duration Transducer (TDT) decoder, trained with full attention, enabling transcription of audio segments up to 24 minutes in a single pass.
Training: Initialized from a FastConformer SSL checkpoint pretrained on LibriLight using wav2vec, then trained for 150,000 steps on 64 A100 GPUs. Stage 2 fine-tuning performed for 2,500 steps on 4 A100 GPUs using ~500 hours of high-quality human-transcribed data from NeMo ASR Set 3.0. Total training corpus: ~120,000 hours of English from the Granary dataset (10K human-transcribed + 110K pseudo-labeled).
Use cases: High-throughput English transcription, subtitle generation, voice assistants, call-center analytics, song-to-lyrics transcription, telephony transcription. Runs on Nvidia GPUs (A100/H100/T4/V100); loadable with as little as 2 GB RAM.