
An encoder-only, CTC-based open speech foundation model from ESPnet/CMU that reproduces Whisper-style multilingual ASR, speech translation and language identification using fully public data. Trained on 320k hours of cleaned YODAS + prior OWSM data across 75 languages.
A solid 1B-parameter dense audio model from ESPnet. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
OWSM (Open Whisper-style Speech Model) is a community effort led by CMU's WAVLab and the ESPnet team to build fully reproducible, openly trained alternatives to OpenAI Whisper. The v4 CTC variant is an encoder-only model using hierarchical multi-task self-conditioned CTC with an E-Branchformer encoder.
Reproducible research, multilingual transcription/translation across 75 languages, forced alignment (CTC segmentation), and as a base for further fine-tuning where training transparency is required.