
An encoder-only, CTC-based open speech foundation model from ESPnet/CMU that reproduces Whisper-style multilingual ASR, speech translation and language identification using fully public data. Trained on 320k hours of cleaned YODAS + prior OWSM data across 75 languages.
A solid 1B-parameter dense audio model from ESPnet. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Cheapest current cloud rentals with at least 1 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM | $0.10 |
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 5070 TiVast.ai · On-Demand · 16 GB VRAM | $0.13 |
NVIDIA GeForce RTX 4090Vast.ai · Spot · 24 GB VRAM | $0.13 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
OWSM (Open Whisper-style Speech Model) is a community effort led by CMU's WAVLab and the ESPnet team to build fully reproducible, openly trained alternatives to OpenAI Whisper. The v4 CTC variant is an encoder-only model using hierarchical multi-task self-conditioned CTC with an E-Branchformer encoder.
Reproducible research, multilingual transcription/translation across 75 languages, forced alignment (CTC segmentation), and as a base for further fine-tuning where training transparency is required.