NVIDIA Parakeet RNNT 1.1B is an XXL FastConformer RNN-Transducer English ASR model jointly developed by NVIDIA NeMo and Suno.ai, offering strong accuracy and streaming-capable inference.
A solid 1.1B-parameter dense audio model from NVIDIA. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Parakeet-RNNT-1.1B is an ASR model that transcribes speech in lower-case English alphabet. Jointly developed by NVIDIA NeMo and Suno.ai, it is an XXL version of the FastConformer Transducer (~1.1B parameters). At release in early 2024, it (along with Parakeet CTC) topped the Hugging Face Open ASR Leaderboard, surpassing Whisper.
Architecture: FastConformer encoder (an optimized Conformer with 8x depthwise-separable convolutional downsampling) with an RNN-Transducer (RNNT) decoder trained with transducer loss in a multitask setup. Supports streaming inference.
Training: Trained using the NVIDIA NeMo toolkit for several hundred epochs on a large multi-domain English corpus (LibriSpeech, Fisher, Switchboard, WSJ-0/1, Common Voice 8.0, National Singapore Corpus 1 & 6, VCTK, VoxPopuli, Europarl, Multilingual LibriSpeech, People's Speech) plus proprietary data.
Use cases: Streaming English ASR, voice assistants, call-center transcription, captioning, and as a base for fine-tuning. Accepts 16 kHz mono-channel audio (WAV) as input.