NVIDIA Parakeet TDT 0.6B v3 is a 600M-parameter multilingual ASR model supporting 25 European languages with automatic language detection, offering the highest throughput among multilingual models on the Hugging Face Open ASR leaderboard.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Parakeet-tdt-0.6b-v3 is a 600-million-parameter multilingual ASR model designed for high-throughput speech-to-text transcription. It extends parakeet-tdt-0.6b-v2 by expanding language support from English to 25 European languages and automatically detects the audio language without prompting. It has the highest throughput among multilingual models on the Hugging Face leaderboard.
Architecture: FastConformer encoder with Token-and-Duration Transducer (TDT) decoder. Uses a unified SentencePiece tokenizer with a vocabulary of 8,192 tokens optimized across all 25 languages. Supports long audio transcription — up to 24 minutes with full attention (A100 80GB) or up to 3 hours with local attention.
Training: Initialized from a multilingual CTC checkpoint pretrained on the Granary dataset, then trained for 150,000 steps on 128 A100 GPUs with temperature sampling (0.5) for language balancing. Stage 2 fine-tuning for 5,000 steps on 4 A100 GPUs used ~7,500 hours of human-transcribed data from NeMo ASR Set 3.0.
Languages: English, Spanish, French, German, Italian, Portuguese, Dutch, Russian, Polish, Ukrainian, Slovak, Bulgarian, Finnish, Romanian, Croatian, Czech, Swedish, Estonian, Hungarian, Lithuanian, Danish, Maltese, Slovenian, Latvian, Greek (25 total).
Use cases: Multilingual transcription services, voice assistants, subtitle generation, conversational AI, voice analytics platforms, on-device multilingual ASR. Provides automatic punctuation, capitalization, and word-level timestamps.