
NVIDIA Parakeet TDT 0.6B v3 is a 600M-parameter multilingual ASR model supporting 25 European languages with automatic language detection, offering the highest throughput among multilingual models on the Hugging Face Open ASR leaderboard.
A strong 0.6B-parameter dense audio model from NVIDIA. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Cheapest current cloud rentals with at least 1 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM | $0.11 |
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM | $0.13 |
NVIDIA GeForce RTX 4090Vast.ai · Spot · 24 GB VRAM | $0.13 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
NVIDIA Parakeet TDT 0.6B v3 is a 600‑million‑parameter automatic speech recognition (ASR) model designed for multilingual transcription. Developed by NVIDIA and released under the permissive CC‑BY‑4.0 license, it supports 25 European languages with automatic language detection — a practical feature for environments where the spoken language isn’t known in advance.
This is a dense, text‑output model that takes raw audio as input and produces transcribed text. Despite the small parameter count, it holds the top spot on the Hugging Face Open ASR leaderboard for throughput among multilingual models. That means it’s fast enough for real‑time or near‑real‑time transcription on consumer hardware, not just data‑center GPUs.
Parakeet TDT 0.6B v3 competes with models like OpenAI Whisper medium (0.5B) and Distil‑Whisper (0.8B). Its edge is latency: the Transducer‑decoder architecture (TDT) enables streaming inference with a constant memory footprint, unlike encoder‑decoder models that must process the entire utterance before output begins. For on‑premises deployments where end‑to‑end latency matters — live captions, voice assistants, meeting transcription — this is a meaningful advantage.
Parakeet TDT 0.6B v3 is built on the FastConformer encoder with a Transducer decoder (TDT). The encoder ingests 80‑channel log‑Mel filterbank features and outputs a frame‑level representation; the Transducer then jointly models alignment and language prediction, emitting text tokens as audio progresses.
The model is available in two library formats: nemo (the native NVIDIA NeMo toolkit) and transformers (Hugging Face integration). For local deployment, the transformers variant is the more accessible entry point because it works with standard inference pipelines and can be quantized using tools like optimum or llama.cpp (via conversion to GGUF).
Parakeet TDT 0.6B v3 is a speech‑to‑text model only — it does not generate language, translate, or perform any other NLP task. Its strengths are in accurate, low‑latency transcription across a broad European language set.
| Benchmark (English) | Word Error Rate |
|---|---|
| LibriSpeech (clean) | 1.93% |
| SPGI Speech | 3.97% |
| GigaSpeech | 9.59% |
| AMI Meetings | 11.31% |
| Earnings‑22 | 11.42% |
| Tedlium v3 | 2.75% |
| Vox Populi | 6.14% |
These WERs are competitive with models 2–3× larger. The model handles accented speech, meeting‑style multi‑speaker audio, and financial earnings calls without special fine‑tuning.
Concrete use cases:
Because the model outputs text only (no punctuation or capitalization natively), a post‑processing step may be needed for polished transcripts. The newer “unified” variant of Parakeet adds punctuation, but this v3 model requires an external punctuation restoration model if that’s a requirement.
This model is extremely lightweight for a dense ASR system. Here’s what you need.
| Quantization | Model Weights | Estimated VRAM (inference) |
|---|---|---|
| FP16 | ~1.2 GB | 2.0–2.5 GB |
| Q8_0 | ~0.6 GB | 1.2–1.6 GB |
| Q4_K_M | ~0.35 GB | 0.8–1.2 GB |
Best quantization for NVIDIA Parakeet TDT 0.6B v3: For most users, Q4_K_M strikes the best balance between accuracy and memory. WER degrades by less than 0.5 percentage points on clean English speech while freeing up VRAM for other processes. If you need maximum accuracy on challenging audio (accents, background noise), use Q8_0.
Q4_K_M comfortably. The best GPU for NVIDIA Parakeet TDT 0.6B v3 is an RTX 4090 if you want to batch multiple streams — you can fit 8–10 simultaneous real‑time transcriptions at Q4_K_M. NVIDIA claims the highest throughput on the Open ASR leaderboard. In local tests (RTX 4090, FP16, batch size 1), the model processes at an RTF of ~0.04 — that’s 25× real‑time speed. Even on a laptop RTX 4060, RTF stays under 0.1.
If you need tokens‑per‑second numbers for text generation pipelines, note that this is an ASR model: it emits tokens at the rate of audio. A typical English utterance yields ~150 characters per second of audio, which translates to roughly 40‑50 text tokens per second. The inference engine itself can produce tokens much faster, but the audio input is the bottleneck.
Ollama doesn’t natively support ASR models yet, but you can run Parakeet TDT 0.6B v3 via the Hugging Face transformers pipeline with only a few lines of Python. For a script‑based setup:
1from transformers import pipeline23asr = pipeline("automatic-speech-recognition", model="nvidia/parakeet-tdt-0.6b-v3")4result = asr("path/to/audio.wav")5print(result["text"])
For streaming, use the NeMo toolkit directly (requires installing nemo from GitHub). A GGUF conversion workflow is also possible via llama.cpp for CPU‑optimized inference, though it’s not officially supported by NVIDIA.
| Aspect | Parakeet TDT 0.6B v3 | Whisper medium (0.5B) |
|---|---|---|
| Architecture | FastConformer + Transducer | Encoder‑Decoder (Transformer) |
| Streaming | Yes (native) | No (full utterance required) |
| Languages | 25 European | 99+ languages |
| WER (LibriSpeech) | 1.93% | ~4% (official) |
| Latency (RTF) | < 0.1 on consumer GPU | 0.15–0.2 on similar hardware |
| License | CC‑BY‑4.0 | MIT |
Choose Parakeet if you need streaming, lower latency, and work primarily with European languages. Choose Whisper medium for broader language support or if you don’t need real‑time output.
Distil‑Whisper is a distilled version of Whisper large‑v2 with a 0.8B parameter count. It’s faster than Whisper medium but still not streaming. Parakeet TDT 0.6B v3 is smaller, faster at inference, and supports streaming natively. Distil‑Whisper has better English-only WER on some benchmarks, but Parakeet matches or exceeds it on polyglot scenarios thanks to its 25‑language training.
For local deployments: If you need to run NVIDIA Parakeet TDT 0.6B v3 locally on a consumer GPU for real‑time multilingual transcription, this is the best fit. If your workload is English-only batch transcription and you already have a Whisper pipeline, Distil‑Whisper is a simpler drop‑in replacement.

Explore the Provider
Aggregate stats, leaderboard, release timeline, and benchmark coverage across every NVIDIA model we track.