NVIDIA Parakeet TDT 0.6B v2 is a 600M-parameter English ASR model that topped the Hugging Face Open ASR leaderboard in May 2025 with a 6.05% WER, capable of transcribing an hour of audio in a single second.
A frontier-grade 0.6B-parameter dense audio model from NVIDIA. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Parakeet-tdt-0.6b-v2 is a 600-million-parameter automatic speech recognition model designed for high-quality English transcription. It features punctuation, capitalization, and accurate word-level timestamp prediction. At release, it topped the Hugging Face Open ASR Leaderboard with a 6.05% WER and achieves an RTFx of ~3380 on HF-Open-ASR (batch size 128), transcribing ~60 minutes of audio in ~1 second.
Architecture: XL variant of the FastConformer architecture with a Token-and-Duration Transducer (TDT) decoder, trained with full attention, enabling transcription of audio segments up to 24 minutes in a single pass.
Training: Initialized from a FastConformer SSL checkpoint pretrained on LibriLight using wav2vec, then trained for 150,000 steps on 64 A100 GPUs. Stage 2 fine-tuning performed for 2,500 steps on 4 A100 GPUs using ~500 hours of high-quality human-transcribed data from NeMo ASR Set 3.0. Total training corpus: ~120,000 hours of English from the Granary dataset (10K human-transcribed + 110K pseudo-labeled).
Use cases: High-throughput English transcription, subtitle generation, voice assistants, call-center analytics, song-to-lyrics transcription, telephony transcription. Runs on Nvidia GPUs (A100/H100/T4/V100); loadable with as little as 2 GB RAM.