NVIDIA

NVIDIA Parakeet CTC 1.1B

NVIDIA Parakeet CTC 1.1B is an XXL FastConformer-CTC English ASR model jointly developed by NVIDIA NeMo and Suno.ai, offering strong non-autoregressive speech recognition accuracy with efficient inference.

1.1B paramsDense

View on Hugging Face Source Code Official Page

Our Take

Best for: Open-source asr workloads

A solid 1.1B-parameter dense audio model from NVIDIA. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters1.1B

ArchitectureDense

ProviderNVIDIA

Download Size21.3 GB

Community

Monthly Downloads814.2K

Likes49

Last Updated8 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

CC-BY-4.0View Full License

Performance & Scoring

Benchmarks

WER

7.4%

MBA Open Score

67.4BB

Benchmark40%

85.2

Popularity25%

65.3

Efficiency25%

40.0

Versatility10%

70.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	1.2 GB
Acer Veriton GN100 AI MiniAcer	SS	1.2 GB
AMD Instinct MI300XAMD	SS	1.2 GB
AMD Instinct MI325XAMD	SS	1.2 GB
AMD Instinct MI355XAMD	SS	1.2 GB
AMD Radeon RX 7600 8GBAMD	SS	1.2 GB
AMD Radeon RX 7700 XTAMD	SS	1.2 GB
AMD Radeon RX 7800 XTAMD	SS	1.2 GB
AMD Radeon RX 7900 XTAMD	SS	1.2 GB
AMD Radeon RX 7900 XTXAMD	SS	1.2 GB
AMD Radeon RX 9070AMD	SS	1.2 GB
AMD Radeon RX 9070 XTAMD	SS	1.2 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	1.2 GB
Apple M4Apple	SS	1.2 GB
Apple M4 Max (40-core GPU)Apple	SS	1.2 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	1.2 GB
Apple M5Apple	SS	1.2 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	1.2 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	1.2 GB
Apple Mac Mini (M1, 2020)Apple	SS	1.2 GB
Apple Mac Mini (M2, 2023)Apple	SS	1.2 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	1.2 GB
Apple Mac Mini (M4, 2024)Apple	SS	1.2 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	1.2 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	1.2 GB

Rows per page

Page 1 of 5

Rent in the Cloud

Cheapest current cloud rentals with at least 1 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM	$0.13
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM	$0.13
NVIDIA GeForce RTX 4090Vast.ai · Spot · 24 GB VRAM	$0.13
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.13
NVIDIA GeForce RTX 4090Vast.ai · On-Demand · 24 GB VRAM	$0.13

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

NVIDIA Parakeet CTC 1.1B is an English automatic speech recognition (ASR) model designed for high-accuracy, non-autoregressive transcription. Developed jointly by NVIDIA NeMo and Suno.ai, it uses a FastConformer-CTC architecture with 1.1 billion dense parameters — meaning every parameter is active during inference, no routing or sparsity tricks. This model targets practitioners who need reliable, low-latency speech-to-text on their own hardware, without relying on cloud APIs.

Parakeet CTC 1.1B sits at the top end of the Parakeet family, which also includes a 0.6B variant. It competes with other open-weight ASR models like OpenAI Whisper large-v3 (1.5B parameters) and Meta’s Wav2Vec2-XLSR-53. Where Whisper uses a transformer encoder-decoder with autoregressive decoding, Parakeet CTC uses a connectionist temporal classification (CTC) head on a FastConformer encoder — this makes inference significantly faster because it decodes in a single forward pass rather than token-by-token. The tradeoff is that CTC models typically require a language model for best accuracy, though Parakeet CTC 1.1B already delivers state-of-the-art results without an external LM.

Trained on a 64,000-hour dataset combining public and proprietary English speech (including LibriSpeech, Fisher, Switchboard, Common Voice, VoxPopuli, and more), this model handles diverse accents, noise conditions, and domains — from clean read speech to conversational meetings and financial earnings calls.

Architecture & Technical Details

Parakeet CTC 1.1B is built on the FastConformer architecture, an optimized variant of the Conformer model that uses a 2D-convolutional subsampling frontend and a stack of conformer blocks with self-attention and depthwise convolutions. The “Fast” prefix refers to architectural changes that reduce computational overhead without sacrificing accuracy — specifically, using grouped convolutions and a more efficient attention mechanism.

The model uses a CTC decoder, which outputs a sequence of character-level probabilities. During inference, the CTC algorithm collapses repeated characters and removes blanks to produce the final transcript. This is inherently non-autoregressive: the entire audio is processed in one shot, and the decoder produces all output logits in parallel. This makes Parakeet CTC 1.1B much faster than autoregressive models like Whisper, especially on longer audio clips.

Key specs:

Parameters: 1.1B (dense, all active)
Architecture: FastConformer encoder + CTC decoder
Modality: Audio → text (English only)
Context length: Not officially specified, but FastConformer can handle up to ~30 seconds of audio effectively (longer clips can be chunked)
License: CC-BY-4.0 — free for commercial use, no restrictions
Framework: NVIDIA NeMo (PyTorch), also available via Hugging Face Transformers

The model was trained using mixed precision (FP16/BF16) and supports inference in FP16 or FP32. It does not require a separate language model, though one can be added for marginal WER improvements.

Capabilities & Use Cases

Parakeet CTC 1.1B excels at transcribing English speech with exceptional accuracy across a wide range of scenarios. The published Word Error Rates (WER) on standard benchmarks tell the story:

Dataset	WER
LibriSpeech clean	1.83%
LibriSpeech other	3.54%
GigaSpeech	10.27%
SPGI Speech	4.20%
TED-LIUM v3	3.54%
Earnings-22	13.69%
AMI (meetings)	15.62%

These numbers are competitive with or better than Whisper large-v3 on most benchmarks, particularly on clean read speech and academic datasets. The model handles meeting transcription, lectures, phone conversations, financial earnings calls, and general dictation with high reliability. It is robust to background noise, music, and silence — a result of the diverse 64k-hour training set.

Concrete use cases:

Real-time captioning for video or live events (low latency due to CTC)
Meeting transcription for local tools (e.g., Otter.ai alternative)
Voice-controlled applications where latency matters
Medical dictation (with fine-tuning on domain data)
Podcast and media transcription for searchable archives
On-device speech-to-text for edge devices with a GPU

Because it’s English-only, it’s not suitable for multilingual transcription. If you need multilingual support, Whisper large-v3 is a better choice.

Running NVIDIA Parakeet CTC 1.1B Locally

This is where Parakeet CTC 1.1B shines — it’s designed for efficient local inference. The CTC decoder means you don’t need an autoregressive beam search, which cuts inference time dramatically.

Hardware Requirements

The model consumes about 2.1 GB of VRAM in FP16 (1.1B parameters × 2 bytes). In FP32, that doubles to ~4.2 GB. With typical inference overhead (activations, buffers), expect:

Quantization	VRAM (approx)	Recommended GPU
FP32	~4.5 GB	Any GPU with ≥6 GB VRAM
FP16	~2.5 GB	GTX 1060 6GB, RTX 2060, RTX 3060, M1/M2
INT8 (via TensorRT or NeMo)	~1.5 GB	RTX 30xx/40xx, M1/M2/Pro/Max
INT4 (via quantization)	~1.0 GB	RTX 4090, M4 Max (experimental)

Minimum: Any GPU with 4 GB VRAM can run FP16 with chunked audio (e.g., GTX 1050 Ti). Realistically, you want at least an RTX 3060 12GB or M1 Mac with 16GB unified memory for comfortable operation with full-length audio.

Recommended: An RTX 4090 or M4 Max (64GB unified) will run FP16 inference on 30-second clips in under 100ms. For production batch processing, a Tesla T4 (16GB) or RTX 4070 is sufficient.

Performance (Tokens per Second)

Because CTC decoding is non-autoregressive, the bottleneck is the encoder forward pass. On typical consumer hardware:

RTX 3090 / 4090: ~200-300 tokens per second (audio seconds processed per second of wall time)
RTX 3060 12GB: ~80-120 tokens/s
M1 Max (32-core GPU): ~100-150 tokens/s
M4 Max: ~180-250 tokens/s

These numbers are for FP16 inference with a batch size of 1. With batch processing (multiple audio clips), throughput scales nearly linearly up to VRAM limits.

Quantization Recommendations

For most local users, FP16 offers the best balance of accuracy and speed. The model’s WER degradation at INT8 is minimal (<0.5% absolute), making INT8 a good choice if VRAM is tight. INT4 quantization is possible with tools like bitsandbytes or NVIDIA TensorRT, but expect a WER increase of 1-2% — acceptable for less critical applications.

Ollama does not yet support Parakeet CTC models natively (it focuses on LLMs). Instead, use NVIDIA NeMo or the Hugging Face Transformers pipeline. The quickest local setup:

1pip install nemo_toolkit[asr]
2python -c "from nemo.collections.asr.models import EncDecCTCModelBPE; model = EncDecCTCModelBPE.from_pretrained('nvidia/parakeet-ctc-1.1b')"

Or via Transformers:

1from transformers import pipeline
2pipe = pipeline("automatic-speech-recognition", model="nvidia/parakeet-ctc-1.1b")

Hardware Compatibility Notes

Apple Silicon: Works via PyTorch MPS backend. M1/M2/M3/M4 with unified memory are excellent — the model fits entirely in memory and runs fast.
Windows/Linux: CUDA required. RTX 20xx and newer are ideal. Older GTX cards (10xx series) work but slower.
No GPU? CPU inference is possible but very slow (~0.5x real-time). Not recommended for interactive use.

How It Compares

vs. OpenAI Whisper large-v3 (1.5B parameters)

Accuracy: Parakeet CTC 1.1B matches or beats Whisper on clean English benchmarks (e.g., LibriSpeech clean 1.83% vs Whisper’s ~2.0%). On noisy or accented speech, Whisper sometimes edges ahead due to its larger training set (1M+ hours) and multilingual support.
Speed: Parakeet CTC is 2-5x faster on the same hardware because of CTC vs autoregressive decoding. For real-time applications, Parakeet wins decisively.
Multilingual: Whisper supports 99 languages; Parakeet is English-only.
VRAM: Parakeet is smaller (1.1B vs 1.5B) and uses less memory, especially in FP16.
License: Both are permissive (CC-BY-4.0 for Parakeet, MIT for Whisper). No restrictions.

When to choose Parakeet: You need low-latency English transcription, are constrained on VRAM, or want faster inference on consumer GPUs.

When to choose Whisper: You need multilingual support, or you need the absolute best accuracy on very noisy or accented speech (though the gap is small).

vs. Wav2Vec2-XLSR-53 (0.3B parameters)

Accuracy: Parakeet is dramatically better — Wav2Vec2-XLSR-53 achieves ~8% WER on LibriSpeech clean vs Parakeet’s 1.83%.
Speed: Wav2Vec2 is also CTC-based, so similar speed characteristics, but smaller model means faster inference.
Use case: Wav2Vec2 is better for fine-tuning on low-resource languages (it was pretrained on 53 languages). Parakeet is strictly English.

When to choose Parakeet: You need state-of-the-art English ASR without fine-tuning.

vs. Parakeet CTC 0.6B

The smaller sibling. Parakeet CTC 0.6B uses half the parameters, requires ~1.2 GB VRAM in FP16, and runs about 1.5x faster. WER is about 1-2% higher on most benchmarks. Choose the 0.6B if you’re on a low-end GPU or need maximum throughput; choose the 1.1B for maximum accuracy.

Related Models

NVIDIA

Explore the Provider

See all NVIDIA models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every NVIDIA model we track.

Open NVIDIA

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

NVIDIA

NVIDIA Parakeet CTC 1.1B

1.1B paramsDense

View on Hugging Face Source Code Official Page

Our Take

Best for: Open-source asr workloads

A solid 1.1B-parameter dense audio model from NVIDIA. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters1.1B

ArchitectureDense

ProviderNVIDIA

Download Size21.3 GB

Community

Monthly Downloads814.2K

Likes49

Last Updated8 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

CC-BY-4.0View Full License

Performance & Scoring

Benchmarks

WER

7.4%

MBA Open Score

67.4BB

Benchmark40%

85.2

Popularity25%

65.3

Efficiency25%

40.0

Versatility10%

70.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	1.2 GB
Acer Veriton GN100 AI MiniAcer	SS	1.2 GB
AMD Instinct MI300XAMD	SS	1.2 GB
AMD Instinct MI325XAMD	SS	1.2 GB
AMD Instinct MI355XAMD	SS	1.2 GB
AMD Radeon RX 7600 8GBAMD	SS	1.2 GB
AMD Radeon RX 7700 XTAMD	SS	1.2 GB
AMD Radeon RX 7800 XTAMD	SS	1.2 GB
AMD Radeon RX 7900 XTAMD	SS	1.2 GB
AMD Radeon RX 7900 XTXAMD	SS	1.2 GB
AMD Radeon RX 9070AMD	SS	1.2 GB
AMD Radeon RX 9070 XTAMD	SS	1.2 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	1.2 GB
Apple M4Apple	SS	1.2 GB
Apple M4 Max (40-core GPU)Apple	SS	1.2 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	1.2 GB
Apple M5Apple	SS	1.2 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	1.2 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	1.2 GB
Apple Mac Mini (M1, 2020)Apple	SS	1.2 GB
Apple Mac Mini (M2, 2023)Apple	SS	1.2 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	1.2 GB
Apple Mac Mini (M4, 2024)Apple	SS	1.2 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	1.2 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	1.2 GB

Rows per page

Page 1 of 5

Rent in the Cloud

Cheapest current cloud rentals with at least 1 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM	$0.13
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM	$0.13
NVIDIA GeForce RTX 4090Vast.ai · Spot · 24 GB VRAM	$0.13
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.13
NVIDIA GeForce RTX 4090Vast.ai · On-Demand · 24 GB VRAM	$0.13

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

Architecture & Technical Details

Key specs:

Parameters: 1.1B (dense, all active)
Architecture: FastConformer encoder + CTC decoder
Modality: Audio → text (English only)
Context length: Not officially specified, but FastConformer can handle up to ~30 seconds of audio effectively (longer clips can be chunked)
License: CC-BY-4.0 — free for commercial use, no restrictions
Framework: NVIDIA NeMo (PyTorch), also available via Hugging Face Transformers

The model was trained using mixed precision (FP16/BF16) and supports inference in FP16 or FP32. It does not require a separate language model, though one can be added for marginal WER improvements.

Capabilities & Use Cases

Parakeet CTC 1.1B excels at transcribing English speech with exceptional accuracy across a wide range of scenarios. The published Word Error Rates (WER) on standard benchmarks tell the story:

Dataset	WER
LibriSpeech clean	1.83%
LibriSpeech other	3.54%
GigaSpeech	10.27%
SPGI Speech	4.20%
TED-LIUM v3	3.54%
Earnings-22	13.69%
AMI (meetings)	15.62%

Concrete use cases:

Real-time captioning for video or live events (low latency due to CTC)
Meeting transcription for local tools (e.g., Otter.ai alternative)
Voice-controlled applications where latency matters
Medical dictation (with fine-tuning on domain data)
Podcast and media transcription for searchable archives
On-device speech-to-text for edge devices with a GPU

Because it’s English-only, it’s not suitable for multilingual transcription. If you need multilingual support, Whisper large-v3 is a better choice.

Running NVIDIA Parakeet CTC 1.1B Locally

This is where Parakeet CTC 1.1B shines — it’s designed for efficient local inference. The CTC decoder means you don’t need an autoregressive beam search, which cuts inference time dramatically.

Hardware Requirements

The model consumes about 2.1 GB of VRAM in FP16 (1.1B parameters × 2 bytes). In FP32, that doubles to ~4.2 GB. With typical inference overhead (activations, buffers), expect:

Quantization	VRAM (approx)	Recommended GPU
FP32	~4.5 GB	Any GPU with ≥6 GB VRAM
FP16	~2.5 GB	GTX 1060 6GB, RTX 2060, RTX 3060, M1/M2
INT8 (via TensorRT or NeMo)	~1.5 GB	RTX 30xx/40xx, M1/M2/Pro/Max
INT4 (via quantization)	~1.0 GB	RTX 4090, M4 Max (experimental)

Performance (Tokens per Second)

Because CTC decoding is non-autoregressive, the bottleneck is the encoder forward pass. On typical consumer hardware:

RTX 3090 / 4090: ~200-300 tokens per second (audio seconds processed per second of wall time)
RTX 3060 12GB: ~80-120 tokens/s
M1 Max (32-core GPU): ~100-150 tokens/s
M4 Max: ~180-250 tokens/s

These numbers are for FP16 inference with a batch size of 1. With batch processing (multiple audio clips), throughput scales nearly linearly up to VRAM limits.

Quantization Recommendations

Ollama does not yet support Parakeet CTC models natively (it focuses on LLMs). Instead, use NVIDIA NeMo or the Hugging Face Transformers pipeline. The quickest local setup:

1pip install nemo_toolkit[asr]
2python -c "from nemo.collections.asr.models import EncDecCTCModelBPE; model = EncDecCTCModelBPE.from_pretrained('nvidia/parakeet-ctc-1.1b')"

Or via Transformers:

1from transformers import pipeline
2pipe = pipeline("automatic-speech-recognition", model="nvidia/parakeet-ctc-1.1b")

Hardware Compatibility Notes

Apple Silicon: Works via PyTorch MPS backend. M1/M2/M3/M4 with unified memory are excellent — the model fits entirely in memory and runs fast.
Windows/Linux: CUDA required. RTX 20xx and newer are ideal. Older GTX cards (10xx series) work but slower.
No GPU? CPU inference is possible but very slow (~0.5x real-time). Not recommended for interactive use.

How It Compares

vs. OpenAI Whisper large-v3 (1.5B parameters)

Accuracy: Parakeet CTC 1.1B matches or beats Whisper on clean English benchmarks (e.g., LibriSpeech clean 1.83% vs Whisper’s ~2.0%). On noisy or accented speech, Whisper sometimes edges ahead due to its larger training set (1M+ hours) and multilingual support.
Speed: Parakeet CTC is 2-5x faster on the same hardware because of CTC vs autoregressive decoding. For real-time applications, Parakeet wins decisively.
Multilingual: Whisper supports 99 languages; Parakeet is English-only.
VRAM: Parakeet is smaller (1.1B vs 1.5B) and uses less memory, especially in FP16.
License: Both are permissive (CC-BY-4.0 for Parakeet, MIT for Whisper). No restrictions.

When to choose Parakeet: You need low-latency English transcription, are constrained on VRAM, or want faster inference on consumer GPUs.

When to choose Whisper: You need multilingual support, or you need the absolute best accuracy on very noisy or accented speech (though the gap is small).

vs. Wav2Vec2-XLSR-53 (0.3B parameters)

Accuracy: Parakeet is dramatically better — Wav2Vec2-XLSR-53 achieves ~8% WER on LibriSpeech clean vs Parakeet’s 1.83%.
Speed: Wav2Vec2 is also CTC-based, so similar speed characteristics, but smaller model means faster inference.
Use case: Wav2Vec2 is better for fine-tuning on low-resource languages (it was pretrained on 53 languages). Parakeet is strictly English.

When to choose Parakeet: You need state-of-the-art English ASR without fine-tuning.