
NVIDIA Parakeet CTC 1.1B is an XXL FastConformer-CTC English ASR model jointly developed by NVIDIA NeMo and Suno.ai, offering strong non-autoregressive speech recognition accuracy with efficient inference.
A solid 1.1B-parameter dense audio model from NVIDIA. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Cheapest current cloud rentals with at least 1 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 4090Vast.ai · Spot · 24 GB VRAM | $0.13 |
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM | $0.13 |
NVIDIA GeForce RTX 4090Vast.ai · On-Demand · 24 GB VRAM | $0.13 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
NVIDIA Parakeet CTC 1.1B is an English automatic speech recognition (ASR) model designed for high-accuracy, non-autoregressive transcription. Developed jointly by NVIDIA NeMo and Suno.ai, it uses a FastConformer-CTC architecture with 1.1 billion dense parameters — meaning every parameter is active during inference, no routing or sparsity tricks. This model targets practitioners who need reliable, low-latency speech-to-text on their own hardware, without relying on cloud APIs.
Parakeet CTC 1.1B sits at the top end of the Parakeet family, which also includes a 0.6B variant. It competes with other open-weight ASR models like OpenAI Whisper large-v3 (1.5B parameters) and Meta’s Wav2Vec2-XLSR-53. Where Whisper uses a transformer encoder-decoder with autoregressive decoding, Parakeet CTC uses a connectionist temporal classification (CTC) head on a FastConformer encoder — this makes inference significantly faster because it decodes in a single forward pass rather than token-by-token. The tradeoff is that CTC models typically require a language model for best accuracy, though Parakeet CTC 1.1B already delivers state-of-the-art results without an external LM.
Trained on a 64,000-hour dataset combining public and proprietary English speech (including LibriSpeech, Fisher, Switchboard, Common Voice, VoxPopuli, and more), this model handles diverse accents, noise conditions, and domains — from clean read speech to conversational meetings and financial earnings calls.
Parakeet CTC 1.1B is built on the FastConformer architecture, an optimized variant of the Conformer model that uses a 2D-convolutional subsampling frontend and a stack of conformer blocks with self-attention and depthwise convolutions. The “Fast” prefix refers to architectural changes that reduce computational overhead without sacrificing accuracy — specifically, using grouped convolutions and a more efficient attention mechanism.
The model uses a CTC decoder, which outputs a sequence of character-level probabilities. During inference, the CTC algorithm collapses repeated characters and removes blanks to produce the final transcript. This is inherently non-autoregressive: the entire audio is processed in one shot, and the decoder produces all output logits in parallel. This makes Parakeet CTC 1.1B much faster than autoregressive models like Whisper, especially on longer audio clips.
Key specs:
The model was trained using mixed precision (FP16/BF16) and supports inference in FP16 or FP32. It does not require a separate language model, though one can be added for marginal WER improvements.
Parakeet CTC 1.1B excels at transcribing English speech with exceptional accuracy across a wide range of scenarios. The published Word Error Rates (WER) on standard benchmarks tell the story:
| Dataset | WER |
|---|---|
| LibriSpeech clean | 1.83% |
| LibriSpeech other | 3.54% |
| GigaSpeech | 10.27% |
| SPGI Speech | 4.20% |
| TED-LIUM v3 | 3.54% |
| Earnings-22 | 13.69% |
| AMI (meetings) | 15.62% |
These numbers are competitive with or better than Whisper large-v3 on most benchmarks, particularly on clean read speech and academic datasets. The model handles meeting transcription, lectures, phone conversations, financial earnings calls, and general dictation with high reliability. It is robust to background noise, music, and silence — a result of the diverse 64k-hour training set.
Concrete use cases:
Because it’s English-only, it’s not suitable for multilingual transcription. If you need multilingual support, Whisper large-v3 is a better choice.
This is where Parakeet CTC 1.1B shines — it’s designed for efficient local inference. The CTC decoder means you don’t need an autoregressive beam search, which cuts inference time dramatically.
The model consumes about 2.1 GB of VRAM in FP16 (1.1B parameters × 2 bytes). In FP32, that doubles to ~4.2 GB. With typical inference overhead (activations, buffers), expect:
| Quantization | VRAM (approx) | Recommended GPU |
|---|---|---|
| FP32 | ~4.5 GB | Any GPU with ≥6 GB VRAM |
| FP16 | ~2.5 GB | GTX 1060 6GB, RTX 2060, RTX 3060, M1/M2 |
| INT8 (via TensorRT or NeMo) | ~1.5 GB | RTX 30xx/40xx, M1/M2/Pro/Max |
| INT4 (via quantization) | ~1.0 GB | RTX 4090, M4 Max (experimental) |
Minimum: Any GPU with 4 GB VRAM can run FP16 with chunked audio (e.g., GTX 1050 Ti). Realistically, you want at least an RTX 3060 12GB or M1 Mac with 16GB unified memory for comfortable operation with full-length audio.
Recommended: An RTX 4090 or M4 Max (64GB unified) will run FP16 inference on 30-second clips in under 100ms. For production batch processing, a Tesla T4 (16GB) or RTX 4070 is sufficient.
Because CTC decoding is non-autoregressive, the bottleneck is the encoder forward pass. On typical consumer hardware:
These numbers are for FP16 inference with a batch size of 1. With batch processing (multiple audio clips), throughput scales nearly linearly up to VRAM limits.
For most local users, FP16 offers the best balance of accuracy and speed. The model’s WER degradation at INT8 is minimal (<0.5% absolute), making INT8 a good choice if VRAM is tight. INT4 quantization is possible with tools like bitsandbytes or NVIDIA TensorRT, but expect a WER increase of 1-2% — acceptable for less critical applications.
Ollama does not yet support Parakeet CTC models natively (it focuses on LLMs). Instead, use NVIDIA NeMo or the Hugging Face Transformers pipeline. The quickest local setup:
1pip install nemo_toolkit[asr]2python -c "from nemo.collections.asr.models import EncDecCTCModelBPE; model = EncDecCTCModelBPE.from_pretrained('nvidia/parakeet-ctc-1.1b')"
Or via Transformers:
1from transformers import pipeline2pipe = pipeline("automatic-speech-recognition", model="nvidia/parakeet-ctc-1.1b")
When to choose Parakeet: You need low-latency English transcription, are constrained on VRAM, or want faster inference on consumer GPUs.
When to choose Whisper: You need multilingual support, or you need the absolute best accuracy on very noisy or accented speech (though the gap is small).
When to choose Parakeet: You need state-of-the-art English ASR without fine-tuning.
The smaller sibling. Parakeet CTC 0.6B uses half the parameters, requires ~1.2 GB VRAM in FP16, and runs about 1.5x faster. WER is about 1-2% higher on most benchmarks. Choose the 0.6B if you’re on a low-end GPU or need maximum throughput; choose the 1.1B for maximum accuracy.

Explore the Provider
Aggregate stats, leaderboard, release timeline, and benchmark coverage across every NVIDIA model we track.