
NVIDIA Parakeet RNNT 1.1B is an XXL FastConformer RNN-Transducer English ASR model jointly developed by NVIDIA NeMo and Suno.ai, offering strong accuracy and streaming-capable inference.
A solid 1.1B-parameter dense audio model from NVIDIA. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Cheapest current cloud rentals with at least 1 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM | $0.11 |
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM | $0.13 |
NVIDIA GeForce RTX 4090Vast.ai · Spot · 24 GB VRAM | $0.13 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
NVIDIA Parakeet RNNT 1.1B is a production-grade English automatic speech recognition (ASR) model developed jointly by NVIDIA NeMo and Suno.ai. It is an XXL variant of the FastConformer Transducer architecture, packing 1.1B dense parameters. The model is designed for developers who need accurate, streaming-capable speech-to-text inference on their own hardware — not through a cloud API.
This model sits at the top of the Open ASR Leaderboard on Hugging Face, beating OpenAI Whisper large-v3 on average Word Error Rate (WER) across a wide range of benchmarks. Its licensing under CC-BY-4.0 makes it suitable for both research and commercial applications. If you’re building a local voice interface, transcription pipeline, or real-time captioning system, Parakeet RNNT 1.1B is a strong candidate.
Parakeet RNNT 1.1B uses the FastConformer encoder paired with an RNN-Transducer (RNNT) decoder. The architecture is dense — all 1.1B parameters are active during inference. This means VRAM consumption scales predictably with the model size, without the variable active-parameter count of Mixture-of-Experts models.
The FastConformer encoder is optimized for both offline and streaming inference. It processes 16 kHz mono audio, outputting lower-case English text. The model supports chunked streaming, enabling sub‑second latency on live audio streams via the NeMo toolkit or NVIDIA Riva NIM microservices.
Context length is not explicitly specified, but the model can handle arbitrarily long audio sequences by processing in segments. In practice, it can transcribe entire meeting recordings (e.g., AMI test set) with a WER of 17.1% — a demanding scenario that requires robust acoustic modeling.
Parakeet RNNT 1.1B is a pure speech-to-text model. It excels in these areas:
Concrete use cases:
This model runs well on consumer hardware. Because it is dense and uses a transducer decoder, you don’t need a datacenter GPU.
| Hardware | VRAM | Feasibility |
|---|---|---|
| NVIDIA RTX 3060 12GB | 12 GB | Excellent |
| NVIDIA RTX 4070 12GB | 12 GB | Excellent |
| NVIDIA RTX 4090 24GB | 24 GB | Overkill but ideal for multi-stream |
| Apple M4 Max (64 GB unified) | Shared | Runs via NeMo with Metal backend |
| NVIDIA RTX 2060 6GB | 6 GB | Single stream possible (FP16) |
The quickest path is using the NVIDIA Riva NIM container or the NeMo Python library:
1pip install nemo_toolkit['all']
Then load the model:
1import nemo.collections.asr as nemo_asr2model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained("nvidia/parakeet-rnnt-1.1b")3transcript = model.transcribe(["audio.wav"])
For streaming inference, deploy the NIM container (see NVIDIA docs for details). Expect real-time performance — the model transcribes faster than real time on most modern GPUs.
The primary competitor at this scale is OpenAI Whisper large-v3 (1.5B parameters). Both are dense models designed for English ASR. Key differences:
| Aspect | Parakeet RNNT 1.1B | Whisper large-v3 |
|---|---|---|
| Architecture | FastConformer + RNNT | Encoder-decoder transformer |
| Streaming | Native (RNNT decoder) | Requires windowing tricks |
| Average WER (Open ASR Leaderboard) | 7.04% | 7.7% |
| Real-Time Factor (RTF) | 14.4 × 10⁻³ | 7.45 × 10⁻³ (faster raw throughput) |
| Multilingual | English only | 99 languages |
| Output format | Lower‑case English only | Punctuation, capitalization, timestamps |
| License | CC-BY-4.0 | MIT (for weights) |
When to choose Parakeet RNNT 1.1B: You need a streaming, low-latency ASR for English-only applications and can trade punctuation/capitalization for higher accuracy in noisy environments. It also has a slightly smaller footprint (1.1B vs 1.5B), which can matter on constrained hardware.
When to choose Whisper large-v3: You need multilingual support, or you require punctuation and casing in the raw output. Whisper is also faster in raw RTF, though Parakeet’s advantage on WER narrows on clean speech.
Both models are viable for local deployment. Parakeet RNNT 1.1B is the better choice if streaming accuracy is your primary goal and you are willing to use a separate punctuation/casing model (e.g., via NeMo) if needed.

Explore the Provider
Aggregate stats, leaderboard, release timeline, and benchmark coverage across every NVIDIA model we track.