
NVIDIA Parakeet TDT 1.1B is an XXL FastConformer Token-and-Duration Transducer English ASR model, offering higher accuracy and 64% greater speed than the comparable Parakeet RNNT 1.1B.
A solid 1.1B-parameter dense audio model from NVIDIA. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Cheapest current cloud rentals with at least 1 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM | $0.11 |
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM | $0.13 |
NVIDIA GeForce RTX 4090Vast.ai · Spot · 24 GB VRAM | $0.13 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
NVIDIA Parakeet TDT 1.1B is an English automatic speech recognition (ASR) model that transcribes spoken audio into lowercase English text. Developed jointly by NVIDIA NeMo and Suno.ai, it uses a FastConformer architecture paired with a Token-and-Duration Transducer (TDT) decoder. At 1.1 billion parameters, it represents the XXL variant in the Parakeet family—designed for applications where transcription accuracy and low latency both matter.
The defining claim for this model is straightforward: NVIDIA states it delivers higher accuracy than the comparable Parakeet RNNT 1.1B while running 64% faster. That speed advantage comes from the TDT architecture, which decouples token prediction from duration prediction, enabling more efficient inference. For practitioners running ASR locally, this translates to real-time or faster-than-real-time transcription on consumer hardware without cloud dependencies.
Parakeet TDT 1.1B occupies the high-accuracy tier of NVIDIA’s open ASR lineup, competing with similarly sized models like Whisper large-v3 and other 1B-class transducers. It is released under the permissive CC-BY-4.0 license, meaning you can deploy, modify, and redistribute it freely for most use cases.
The model is built on a FastConformer encoder—an optimized variant of the Conformer architecture that reduces computation while preserving the ability to capture both local and global context in audio. The encoder processes 160kHz mono audio into frame-level representations.
The Token-and-Duration Transducer (TDT) decoder differs from the standard Recurrent Neural Network Transducer (RNNT) in how it handles output timing. In an RNNT, the model jointly predicts token types and their alignments, which creates a computational bottleneck during decoding. TDT separates these two tasks: the duration predictor estimates how many output frames a token occupies, while the token predictor determines which token to emit. This separation allows the decoder to skip unnecessary computation, which is the source of the 64% speed improvement claimed over the RNNT variant.
Key architectural specs:
The model uses a subword tokenizer (BPE) trained on its training corpus, which includes LibriSpeech, Fisher, Switchboard, WSJ, VoxPopuli, Common Voice, and others. This diverse training set means the model handles both read speech and spontaneous conversational speech.
Context length is not officially specified, but in practice FastConformer models process audio in fixed-length windows. For long-form audio (meetings, podcasts), the model handles segmentation internally or you can chunk the input.
Parakeet TDT 1.1B is an English-only transcription model. It does not support speaker diarization, punctuation, or capitalization in its base form—those features require a separate post-processing step or a model variant like Parakeet-unified.
Published word error rates (WER) from the model card demonstrate its accuracy across diverse domains:
| Dataset | WER |
|---|---|
| LibriSpeech (clean) | 1.39% |
| LibriSpeech (other) | 2.62% |
| GigaSpeech | 9.55% |
| Earnings-22 | 14.65% |
| AMI (meetings) | 15.90% |
| TED-LIUM v3 | 3.56% |
| SPGI Speech | 3.42% |
| Vox Populi | 6.99% |
The model performs best on clean read speech (LibriSpeech) and remains competitive on financial earnings calls, TED talks, and meeting scenarios. The higher WER on AMI (15.9%) is typical for far-field meeting transcription and represents a known challenge for all ASR systems.
Concrete use cases where this model fits:
If you need punctuation, capitalization, or streaming with low (160ms) latency, check the Parakeet-unified-en-0.6b model instead, which trades some parameter count for those features.
At FP16 precision, the model occupies approximately 2.2 GB of VRAM for the weights alone. Inference requires additional memory for activations and intermediate tensors. Realistic VRAM requirements:
The model can also run on CPU with lower throughput. OLMo or CPU-offloaded inference works for small batches of short audio but is not recommended for real-time use.
Because the model uses a dense architecture (not MoE), quantization directly reduces memory and accelerates inference. Recommended approaches:
Note: The model is natively distributed in NeMo format, not GGUF. If you want quantized GGUF files, you must convert the checkpoint yourself using llama.cpp’s conversion tools. Most practitioners will use FP16 via NeMo and let the framework handle optimization.
On an RTX 4090 with FP16 and batch size 1, the model transcribes short audio clips at roughly 5–10x real-time (a 30-second audio clip processes in 3–6 seconds). Throughput scales with batch size: batch of 8 on the same GPU yields around 80–120 seconds of audio per second of wall time.
On M2 Max (96 GB unified memory), expect similar real-time factors. On RTX 3060 (12 GB), performance dips to 3–5x real-time for single clips.
The model runs via NVIDIA NeMo. Installation and one-shot inference:
1pip install nemo_toolkit['all']2
1import nemo.collections.asr as nemo_asr2model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained("nvidia/parakeet-tdt-1.1b")3transcription = model.transcribe(["audio_file.wav"])4print(transcription[0].text)
NeMo handles audio downsampling to 16kHz automatically. If you run into VRAM limits, reduce the batch size or convert to INT8.
The closest comparison is the RNNT sibling. Both are 1.1B-parameter FastConformer models trained on the same dataset. The TDT version achieves equal or better WER while running 64% faster during decoding (NVIDIA’s published figure). If you are choosing between the two, go with TDT unless you have a specific reason to use the RNNT decoder (e.g., a custom pipeline that depends on RNNT internals). There is no accuracy tradeoff.
Whisper large-v3 has 40% more parameters and supports multilingual transcription plus punctuation and casing out of the box. On English benchmarks, Parakeet TDT 1.1B matches or beats Whisper large-v3’s WER on LibriSpeech clean (1.39% vs. ~1.5%) but trails on noisy or accented speech. Whisper is a better choice if you need multilingual support, punctuation, or a model that works in a wider range of acoustic conditions. Parakeet TDT wins on speed and parameter efficiency—it runs faster on the same hardware and requires less VRAM.

Explore the Provider
Aggregate stats, leaderboard, release timeline, and benchmark coverage across every NVIDIA model we track.