
A knowledge-distilled, English-only version of OpenAI Whisper Large v3 from Hugging Face. Trained on 98k hours with a 'patient' teacher and SpecAugment, it runs ~1.5× faster than Whisper Large v3 Turbo while matching accuracy.
A solid 0.8B-parameter dense audio model from Hugging Face. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Cheapest current cloud rentals with at least 1 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA GeForce RTX 3080Vast.ai · Spot · 10 GB VRAM | $0.03 |
NVIDIA GeForce RTX 3080Vast.ai · On-Demand · 10 GB VRAM | $0.03 |
NVIDIA GeForce RTX 5060 TiVast.ai · Spot · 16 GB VRAM | $0.09 |
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM | $0.09 |
NVIDIA GeForce RTX 5060 TiVast.ai · On-Demand · 16 GB VRAM | $0.10 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
Distil-Whisper Large v3.5 is an English-only automatic speech recognition (ASR) model developed by Hugging Face’s Distil-Whisper team. It is a knowledge-distilled version of OpenAI’s Whisper Large v3, optimized for local inference on consumer hardware. At 0.8B parameters (756M actual, rounded to 0.8B), it occupies a sweet spot between accuracy and speed: it runs approximately 1.5× faster than the official Whisper Large v3 Turbo (809M parameters) while maintaining comparable word error rates (WER).
What makes this model stand out is its training recipe. Unlike earlier Distil-Whisper variants, v3.5 was trained on over 98,000 hours of diverse public data (more than 4× the data used for v3), using a “patient” teacher strategy and SpecAugment for aggressive data augmentation. This produces a model that is more robust to noise, accents, and varied recording conditions than previous distilled checkpoints. It is released under the MIT license, making it suitable for commercial and research applications.
For practitioners who want to run ASR locally without sacrificing accuracy for speed, Distil-Whisper Large v3.5 is a drop-in replacement for Whisper Large v3 and its Turbo variant — delivering near-identical accuracy on short-form transcription while being significantly faster on long-form audio.
Distil-Whisper Large v3.5 uses a standard dense encoder-decoder transformer architecture, the same as the original Whisper. It has 756 million parameters (0.8B), all active during inference — no Mixture-of-Experts or sparse activation. This means VRAM usage is proportional to the full parameter count.
The model retains the 30-second chunk processing design of Whisper: it transcribes audio in fixed 30-second segments, making it well-suited for both short clips and long-form audio (via sliding window). The context length is effectively the 30-second window; the model does not have an extended context beyond that, unlike text-based LLMs.
Key architectural details from training:
Because the encoder is frozen during distillation, Distil-Whisper Large v3.5 can also be used as a draft model for speculative decoding with the full Whisper Large v3. This allows you to load only two extra decoder layers and share the encoder, achieving ~2× faster inference than Whisper Large v3 while producing identical outputs. This technique is especially useful for latency-sensitive applications where you need the accuracy of the full model but the speed of the distilled one.
This model is a text-only ASR system: it takes audio waveforms as input and outputs transcribed text. It is not a speech-to-text + translation model (English only).
What it excels at:
Where it falls short:
Concrete use cases:
This model is designed to run on consumer hardware with moderate VRAM. Here’s what you need:
VRAM requirements (approx.)
Recommended quantization for most users: Q4_K_M
This preserves near‑lossless accuracy while halving memory. Benchmarks from the Distil-Whisper team show <0.5% WER degradation at Q4_K_M relative to FP16. For maximum accuracy with moderate VRAM, use FP16.
Consumer hardware that works:
Expected performance (tokens per second)
Quickest way to get started:
Use [Ollama](https://ollama.com) — it supports Distil-Whisper models via the distil-whisper tag. Run ollama run distil-whisper and feed audio files. For custom integration, use Hugging Face Transformers with from_pretrained('distil-whisper/distil-large-v3.5').
Hardware tip: If you plan to use speculative decoding with Whisper Large v3, you need enough VRAM to hold both the full model and the draft decoder (about 1.5 GB extra). An RTX 3090/4090 is ideal.
| Model | Params | Relative RTFx | Short‑Form OOD WER | Long‑Form OOD WER | Notes |
|---|---|---|---|---|---|
| distil-large-v3.5 | 756M | 1.46 | 7.08 | 11.39 | Recommended for balance of speed and accuracy |
| large-v3-turbo | 809M | 1.0 | 7.30 | 10.25 | Best long‑form accuracy; slightly slower |
| distil-large-v3 | 756M | 1.44 | 7.53 | 11.6 | Older variant; lower accuracy than v3.5 |
When to choose Distil-Whisper Large v3.5:
When to choose Whisper Large v3 Turbo:
When to choose an even smaller model:
Distil-Whisper Large v3.5 also supports speculative decoding — if you need the full model’s accuracy but can accept variable latency, combine it with Whisper Large v3 for nearly identical output at 2× the throughput. No other distilled Whisper variant offers this pairing natively.

Explore the Provider
Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Hugging Face model we track.