NVIDIA

NVIDIA Parakeet TDT 1.1B

NVIDIA Parakeet TDT 1.1B is an XXL FastConformer Token-and-Duration Transducer English ASR model, offering higher accuracy and 64% greater speed than the comparable Parakeet RNNT 1.1B.

1.1B paramsDense

View on Hugging Face Source Code Official Page

Our Take

Best for: Open-source asr workloads

A solid 1.1B-parameter dense audio model from NVIDIA. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters1.1B

ArchitectureDense

ProviderNVIDIA

Download Size4.3 GB

Community

Monthly Downloads18.6K

Likes125

Last Updated6 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

CC-BY-4.0View Full License

Performance & Scoring

Benchmarks

WER

7.0%

MBA Open Score

66.1BB

Benchmark40%

86.0

Popularity25%

51.1

Efficiency25%

47.8

Versatility10%

70.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	1.2 GB
Acer Veriton GN100 AI MiniAcer	SS	1.2 GB
AMD Instinct MI300XAMD	SS	1.2 GB
AMD Instinct MI325XAMD	SS	1.2 GB
AMD Instinct MI355XAMD	SS	1.2 GB
AMD Radeon RX 7600 8GBAMD	SS	1.2 GB
AMD Radeon RX 7700 XTAMD	SS	1.2 GB
AMD Radeon RX 7800 XTAMD	SS	1.2 GB
AMD Radeon RX 7900 XTAMD	SS	1.2 GB
AMD Radeon RX 7900 XTXAMD	SS	1.2 GB
AMD Radeon RX 9070AMD	SS	1.2 GB
AMD Radeon RX 9070 XTAMD	SS	1.2 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	1.2 GB
Apple M4Apple	SS	1.2 GB
Apple M4 Max (40-core GPU)Apple	SS	1.2 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	1.2 GB
Apple M5Apple	SS	1.2 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	1.2 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	1.2 GB
Apple Mac Mini (M1, 2020)Apple	SS	1.2 GB
Apple Mac Mini (M2, 2023)Apple	SS	1.2 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	1.2 GB
Apple Mac Mini (M4, 2024)Apple	SS	1.2 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	1.2 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	1.2 GB

Rows per page

Page 1 of 5

Rent in the Cloud

Cheapest current cloud rentals with at least 1 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA L4Vast.ai · Spot · 24 GB VRAM	$0.03
NVIDIA L4Vast.ai · On-Demand · 24 GB VRAM	$0.04
NVIDIA GeForce RTX 5060 TiVast.ai · Spot · 16 GB VRAM	$0.09
NVIDIA GeForce RTX 5060 TiVast.ai · On-Demand · 16 GB VRAM	$0.10
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

NVIDIA Parakeet TDT 1.1B is an English automatic speech recognition (ASR) model that transcribes spoken audio into lowercase English text. Developed jointly by NVIDIA NeMo and Suno.ai, it uses a FastConformer architecture paired with a Token-and-Duration Transducer (TDT) decoder. At 1.1 billion parameters, it represents the XXL variant in the Parakeet family—designed for applications where transcription accuracy and low latency both matter.

The defining claim for this model is straightforward: NVIDIA states it delivers higher accuracy than the comparable Parakeet RNNT 1.1B while running 64% faster. That speed advantage comes from the TDT architecture, which decouples token prediction from duration prediction, enabling more efficient inference. For practitioners running ASR locally, this translates to real-time or faster-than-real-time transcription on consumer hardware without cloud dependencies.

Parakeet TDT 1.1B occupies the high-accuracy tier of NVIDIA’s open ASR lineup, competing with similarly sized models like Whisper large-v3 and other 1B-class transducers. It is released under the permissive CC-BY-4.0 license, meaning you can deploy, modify, and redistribute it freely for most use cases.

Architecture & Technical Details

The model is built on a FastConformer encoder—an optimized variant of the Conformer architecture that reduces computation while preserving the ability to capture both local and global context in audio. The encoder processes 160kHz mono audio into frame-level representations.

The Token-and-Duration Transducer (TDT) decoder differs from the standard Recurrent Neural Network Transducer (RNNT) in how it handles output timing. In an RNNT, the model jointly predicts token types and their alignments, which creates a computational bottleneck during decoding. TDT separates these two tasks: the duration predictor estimates how many output frames a token occupies, while the token predictor determines which token to emit. This separation allows the decoder to skip unnecessary computation, which is the source of the 64% speed improvement claimed over the RNNT variant.

Key architectural specs:

Parameters: 1.1B (dense, all parameters active during inference)
Encoder: FastConformer XXL
Decoder: Token-and-Duration Transducer
Input: Single-channel 16kHz mono audio
Output: Lowercase English text (no punctuation, no capitalization)
Framework: NVIDIA NeMo

The model uses a subword tokenizer (BPE) trained on its training corpus, which includes LibriSpeech, Fisher, Switchboard, WSJ, VoxPopuli, Common Voice, and others. This diverse training set means the model handles both read speech and spontaneous conversational speech.

Context length is not officially specified, but in practice FastConformer models process audio in fixed-length windows. For long-form audio (meetings, podcasts), the model handles segmentation internally or you can chunk the input.

Capabilities & Use Cases

Parakeet TDT 1.1B is an English-only transcription model. It does not support speaker diarization, punctuation, or capitalization in its base form—those features require a separate post-processing step or a model variant like Parakeet-unified.

Published word error rates (WER) from the model card demonstrate its accuracy across diverse domains:

Dataset	WER
LibriSpeech (clean)	1.39%
LibriSpeech (other)	2.62%
GigaSpeech	9.55%
Earnings-22	14.65%
AMI (meetings)	15.90%
TED-LIUM v3	3.56%
SPGI Speech	3.42%
Vox Populi	6.99%

The model performs best on clean read speech (LibriSpeech) and remains competitive on financial earnings calls, TED talks, and meeting scenarios. The higher WER on AMI (15.9%) is typical for far-field meeting transcription and represents a known challenge for all ASR systems.

Concrete use cases where this model fits:

Real-time meeting transcription on a local machine—no network calls, no API fees
Podcast and media transcription where accuracy and turnaround time matter
Voice-controlled applications that need on-device speech-to-text
Call center analytics running on-premises for compliance or quality monitoring
Academic research on speech recognition with a permissively licensed model

If you need punctuation, capitalization, or streaming with low (160ms) latency, check the Parakeet-unified-en-0.6b model instead, which trades some parameter count for those features.

Running NVIDIA Parakeet TDT 1.1B Locally

Hardware Requirements

At FP16 precision, the model occupies approximately 2.2 GB of VRAM for the weights alone. Inference requires additional memory for activations and intermediate tensors. Realistic VRAM requirements:

Minimum (FP16, batch size 1): ~4 GB VRAM (e.g., RTX 3050, GTX 1660 Super, M1 Mac)
Recommended (FP16, batch size 1–2): 8 GB VRAM (RTX 3070, RTX 4060, M1 Pro/Max)
Comfortable (FP16, batch size 4–8): 12–16 GB VRAM (RTX 4090, A4500, M2 Max)

The model can also run on CPU with lower throughput. OLMo or CPU-offloaded inference works for small batches of short audio but is not recommended for real-time use.

Quantization

Because the model uses a dense architecture (not MoE), quantization directly reduces memory and accelerates inference. Recommended approaches:

FP16: Default, best accuracy. Use for production where precision is critical.
INT8 (via TensorRT or NeMo quantization toolkit): Reduces VRAM to ~1.2 GB with minimal WER degradation. Suitable for memory-constrained GPUs (4–6 GB).
Q4_K_M or Q5_K_M (if converted to GGUF format): Reduces VRAM to ~0.7–0.9 GB. Useful for edge devices or GPUs with <4 GB.

Note: The model is natively distributed in NeMo format, not GGUF. If you want quantized GGUF files, you must convert the checkpoint yourself using llama.cpp’s conversion tools. Most practitioners will use FP16 via NeMo and let the framework handle optimization.

Expected Performance

On an RTX 4090 with FP16 and batch size 1, the model transcribes short audio clips at roughly 5–10x real-time (a 30-second audio clip processes in 3–6 seconds). Throughput scales with batch size: batch of 8 on the same GPU yields around 80–120 seconds of audio per second of wall time.

On M2 Max (96 GB unified memory), expect similar real-time factors. On RTX 3060 (12 GB), performance dips to 3–5x real-time for single clips.

Quick Start with NeMo

The model runs via NVIDIA NeMo. Installation and one-shot inference:

1pip install nemo_toolkit['all']
2

1import nemo.collections.asr as nemo_asr
2model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained("nvidia/parakeet-tdt-1.1b")
3transcription = model.transcribe(["audio_file.wav"])
4print(transcription[0].text)

NeMo handles audio downsampling to 16kHz automatically. If you run into VRAM limits, reduce the batch size or convert to INT8.

How It Compares

vs. Parakeet RNNT 1.1B

The closest comparison is the RNNT sibling. Both are 1.1B-parameter FastConformer models trained on the same dataset. The TDT version achieves equal or better WER while running 64% faster during decoding (NVIDIA’s published figure). If you are choosing between the two, go with TDT unless you have a specific reason to use the RNNT decoder (e.g., a custom pipeline that depends on RNNT internals). There is no accuracy tradeoff.

vs. Whisper large-v3 (1.55B)

Whisper large-v3 has 40% more parameters and supports multilingual transcription plus punctuation and casing out of the box. On English benchmarks, Parakeet TDT 1.1B matches or beats Whisper large-v3’s WER on LibriSpeech clean (1.39% vs. ~1.5%) but trails on noisy or accented speech. Whisper is a better choice if you need multilingual support, punctuation, or a model that works in a wider range of acoustic conditions. Parakeet TDT wins on speed and parameter efficiency—it runs faster on the same hardware and requires less VRAM.

When to Choose Parakeet TDT 1.1B

You need fast, real-time or super-real-time English transcription on a local GPU.
You don’t need punctuation or capitalization (or you have a separate text-inference step for that).
You want the permissive CC-BY-4.0 license for commercial deployment.
You’re already in the NeMo ecosystem and want minimal integration friction.

When to Look Elsewhere

You need punctuation, capitalization, or streaming with <200ms latency → Parakeet-unified-en-0.6b
You need multilingual ASR → Whisper large-v3 or Canary (NVIDIA)
You need speaker diarization built in → A separate diarization pipeline paired with any ASR model

Related Models

NVIDIA

Explore the Provider

See all NVIDIA models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every NVIDIA model we track.

Open NVIDIA

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

NVIDIA

NVIDIA Parakeet TDT 1.1B

NVIDIA Parakeet TDT 1.1B is an XXL FastConformer Token-and-Duration Transducer English ASR model, offering higher accuracy and 64% greater speed than the comparable Parakeet RNNT 1.1B.

1.1B paramsDense

View on Hugging Face Source Code Official Page

Our Take

Best for: Open-source asr workloads

A solid 1.1B-parameter dense audio model from NVIDIA. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters1.1B

ArchitectureDense

ProviderNVIDIA

Download Size4.3 GB

Community

Monthly Downloads18.6K

Likes125

Last Updated6 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

CC-BY-4.0View Full License

Performance & Scoring

Benchmarks

WER

7.0%

MBA Open Score

66.1BB

Benchmark40%

86.0

Popularity25%

51.1

Efficiency25%

47.8

Versatility10%

70.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	1.2 GB
Acer Veriton GN100 AI MiniAcer	SS	1.2 GB
AMD Instinct MI300XAMD	SS	1.2 GB
AMD Instinct MI325XAMD	SS	1.2 GB
AMD Instinct MI355XAMD	SS	1.2 GB
AMD Radeon RX 7600 8GBAMD	SS	1.2 GB
AMD Radeon RX 7700 XTAMD	SS	1.2 GB
AMD Radeon RX 7800 XTAMD	SS	1.2 GB
AMD Radeon RX 7900 XTAMD	SS	1.2 GB
AMD Radeon RX 7900 XTXAMD	SS	1.2 GB
AMD Radeon RX 9070AMD	SS	1.2 GB
AMD Radeon RX 9070 XTAMD	SS	1.2 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	1.2 GB
Apple M4Apple	SS	1.2 GB
Apple M4 Max (40-core GPU)Apple	SS	1.2 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	1.2 GB
Apple M5Apple	SS	1.2 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	1.2 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	1.2 GB
Apple Mac Mini (M1, 2020)Apple	SS	1.2 GB
Apple Mac Mini (M2, 2023)Apple	SS	1.2 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	1.2 GB
Apple Mac Mini (M4, 2024)Apple	SS	1.2 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	1.2 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	1.2 GB

Rows per page

Page 1 of 5

Rent in the Cloud

Cheapest current cloud rentals with at least 1 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA L4Vast.ai · Spot · 24 GB VRAM	$0.03
NVIDIA L4Vast.ai · On-Demand · 24 GB VRAM	$0.04
NVIDIA GeForce RTX 5060 TiVast.ai · Spot · 16 GB VRAM	$0.09
NVIDIA GeForce RTX 5060 TiVast.ai · On-Demand · 16 GB VRAM	$0.10
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

Architecture & Technical Details

Key architectural specs:

Parameters: 1.1B (dense, all parameters active during inference)
Encoder: FastConformer XXL
Decoder: Token-and-Duration Transducer
Input: Single-channel 16kHz mono audio
Output: Lowercase English text (no punctuation, no capitalization)
Framework: NVIDIA NeMo

Capabilities & Use Cases

Published word error rates (WER) from the model card demonstrate its accuracy across diverse domains:

Dataset	WER
LibriSpeech (clean)	1.39%
LibriSpeech (other)	2.62%
GigaSpeech	9.55%
Earnings-22	14.65%
AMI (meetings)	15.90%
TED-LIUM v3	3.56%
SPGI Speech	3.42%
Vox Populi	6.99%

Concrete use cases where this model fits:

Real-time meeting transcription on a local machine—no network calls, no API fees
Podcast and media transcription where accuracy and turnaround time matter
Voice-controlled applications that need on-device speech-to-text
Call center analytics running on-premises for compliance or quality monitoring
Academic research on speech recognition with a permissively licensed model

If you need punctuation, capitalization, or streaming with low (160ms) latency, check the Parakeet-unified-en-0.6b model instead, which trades some parameter count for those features.

Running NVIDIA Parakeet TDT 1.1B Locally

Hardware Requirements

Minimum (FP16, batch size 1): ~4 GB VRAM (e.g., RTX 3050, GTX 1660 Super, M1 Mac)
Recommended (FP16, batch size 1–2): 8 GB VRAM (RTX 3070, RTX 4060, M1 Pro/Max)
Comfortable (FP16, batch size 4–8): 12–16 GB VRAM (RTX 4090, A4500, M2 Max)

The model can also run on CPU with lower throughput. OLMo or CPU-offloaded inference works for small batches of short audio but is not recommended for real-time use.

Quantization

Because the model uses a dense architecture (not MoE), quantization directly reduces memory and accelerates inference. Recommended approaches:

FP16: Default, best accuracy. Use for production where precision is critical.
INT8 (via TensorRT or NeMo quantization toolkit): Reduces VRAM to ~1.2 GB with minimal WER degradation. Suitable for memory-constrained GPUs (4–6 GB).
Q4_K_M or Q5_K_M (if converted to GGUF format): Reduces VRAM to ~0.7–0.9 GB. Useful for edge devices or GPUs with <4 GB.

Expected Performance

On M2 Max (96 GB unified memory), expect similar real-time factors. On RTX 3060 (12 GB), performance dips to 3–5x real-time for single clips.

Quick Start with NeMo

The model runs via NVIDIA NeMo. Installation and one-shot inference:

1pip install nemo_toolkit['all']
2

1import nemo.collections.asr as nemo_asr
2model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained("nvidia/parakeet-tdt-1.1b")
3transcription = model.transcribe(["audio_file.wav"])
4print(transcription[0].text)

NeMo handles audio downsampling to 16kHz automatically. If you run into VRAM limits, reduce the batch size or convert to INT8.

How It Compares

vs. Parakeet RNNT 1.1B

vs. Whisper large-v3 (1.55B)

When to Choose Parakeet TDT 1.1B

You need fast, real-time or super-real-time English transcription on a local GPU.
You don’t need punctuation or capitalization (or you have a separate text-inference step for that).
You want the permissive CC-BY-4.0 license for commercial deployment.
You’re already in the NeMo ecosystem and want minimal integration friction.

When to Look Elsewhere

You need punctuation, capitalization, or streaming with <200ms latency → Parakeet-unified-en-0.6b
You need multilingual ASR → Whisper large-v3 or Canary (NVIDIA)
You need speaker diarization built in → A separate diarization pipeline paired with any ASR model