An encoder-only, CTC-based open speech foundation model from ESPnet/CMU that reproduces Whisper-style multilingual ASR, speech translation and language identification using fully public data. Trained on 320k hours of cleaned YODAS + prior OWSM data across 75 languages.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
OWSM (Open Whisper-style Speech Model) is a community effort led by CMU's WAVLab and the ESPnet team to build fully reproducible, openly trained alternatives to OpenAI Whisper. The v4 CTC variant is an encoder-only model using hierarchical multi-task self-conditioned CTC with an E-Branchformer encoder.
Reproducible research, multilingual transcription/translation across 75 languages, forced alignment (CTC segmentation), and as a base for further fine-tuning where training transparency is required.