NVIDIA Parakeet TDT 0.6B v3 is a 600M-parameter multilingual ASR model supporting 25 European languages with automatic language detection, offering the highest throughput among multilingual models on the Hugging Face Open ASR leaderboard.
A frontier-grade 0.6B-parameter dense audio model from NVIDIA. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Parakeet-tdt-0.6b-v3 is a 600-million-parameter multilingual ASR model designed for high-throughput speech-to-text transcription. It extends parakeet-tdt-0.6b-v2 by expanding language support from English to 25 European languages and automatically detects the audio language without prompting. It has the highest throughput among multilingual models on the Hugging Face leaderboard.
Architecture: FastConformer encoder with Token-and-Duration Transducer (TDT) decoder. Uses a unified SentencePiece tokenizer with a vocabulary of 8,192 tokens optimized across all 25 languages. Supports long audio transcription — up to 24 minutes with full attention (A100 80GB) or up to 3 hours with local attention.
Training: Initialized from a multilingual CTC checkpoint pretrained on the Granary dataset, then trained for 150,000 steps on 128 A100 GPUs with temperature sampling (0.5) for language balancing. Stage 2 fine-tuning for 5,000 steps on 4 A100 GPUs used ~7,500 hours of human-transcribed data from NeMo ASR Set 3.0.
Languages: English, Spanish, French, German, Italian, Portuguese, Dutch, Russian, Polish, Ukrainian, Slovak, Bulgarian, Finnish, Romanian, Croatian, Czech, Swedish, Estonian, Hungarian, Lithuanian, Danish, Maltese, Slovenian, Latvian, Greek (25 total).