ESPnet

OWSM CTC v4 1B

An encoder-only, CTC-based open speech foundation model from ESPnet/CMU that reproduces Whisper-style multilingual ASR, speech translation and language identification using fully public data. Trained on 320k hours of cleaned YODAS + prior OWSM data across 75 languages.

1B paramsDense

View on Hugging Face Source Code Official Page

Our Take

Best for: Open-source asr workloads

A solid 1B-parameter dense audio model from ESPnet. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters1B

ArchitectureDense

ProviderESPnet

Download Size4.0 GB

Community

Monthly Downloads12.7K

Likes8

Last Updated1 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

CC-BY-4.0View Full License

Performance & Scoring

Benchmarks

WER

7.4%

Overall Score

59.3BB

Benchmark40%

85.2

Popularity25%

28.0

Efficiency25%

48.9

Versatility10%

60.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	1.1 GB
Acer Veriton GN100 AI MiniAcer	SS	1.1 GB
AMD Instinct MI300XAMD	SS	1.1 GB
AMD Instinct MI325XAMD	SS	1.1 GB
AMD Instinct MI355XAMD	SS	1.1 GB
AMD Radeon RX 7600 8GBAMD	SS	1.1 GB
AMD Radeon RX 7700 XTAMD	SS	1.1 GB
AMD Radeon RX 7800 XTAMD	SS	1.1 GB
AMD Radeon RX 7900 XTAMD	SS	1.1 GB
AMD Radeon RX 7900 XTXAMD	SS	1.1 GB
AMD Radeon RX 9070AMD	SS	1.1 GB
AMD Radeon RX 9070 XTAMD	SS	1.1 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	1.1 GB
Apple M4Apple	SS	1.1 GB
Apple M4 Max (40-core GPU)Apple	SS	1.1 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	1.1 GB
Apple M5Apple	SS	1.1 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	1.1 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	1.1 GB
Apple Mac Mini (M1, 2020)Apple	SS	1.1 GB
Apple Mac Mini (M2, 2023)Apple	SS	1.1 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	1.1 GB
Apple Mac Mini (M4, 2024)Apple	SS	1.1 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	1.1 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	1.1 GB

Rows per page

Page 1 of 5

About This Model

OWSM (Open Whisper-style Speech Model) is a community effort led by CMU's WAVLab and the ESPnet team to build fully reproducible, openly trained alternatives to OpenAI Whisper. The v4 CTC variant is an encoder-only model using hierarchical multi-task self-conditioned CTC with an E-Branchformer encoder.

Architecture

~1B parameter E-Branchformer encoder with CTC heads (no autoregressive decoder)
128 Mel filterbanks, 8× subsampling → 80 ms frame resolution
Trained on 16 kHz audio in fixed 30 s chunks

What makes it distinctive

Fully open: weights, training data (curated YODAS), code, and logs are all publicly released
Faster than AED models: several× faster inference than Whisper/Canary thanks to CTC greedy decoding
Supports ASR, any-to-any speech translation, language identification, and utterance-level timestamps
Won Best Student Paper at INTERSPEECH 2025 (OWSM v4)

Use cases

Reproducible research, multilingual transcription/translation across 75 languages, forced alignment (CTC segmentation), and as a base for further fine-tuning where training transparency is required.

Find the Best Hardware for This Model

Use our hardware calculator to find the optimal device for running this model.

ESPnet

OWSM CTC v4 1B

1B paramsDense

View on Hugging Face Source Code Official Page

Our Take

Best for: Open-source asr workloads

A solid 1B-parameter dense audio model from ESPnet. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters1B

ArchitectureDense

ProviderESPnet

Download Size4.0 GB

Community

Monthly Downloads12.7K

Likes8

Last Updated1 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

CC-BY-4.0View Full License

Performance & Scoring

Benchmarks

WER

7.4%

Overall Score

59.3BB

Benchmark40%

85.2

Popularity25%

28.0

Efficiency25%

48.9

Versatility10%

60.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	1.1 GB
Acer Veriton GN100 AI MiniAcer	SS	1.1 GB
AMD Instinct MI300XAMD	SS	1.1 GB
AMD Instinct MI325XAMD	SS	1.1 GB
AMD Instinct MI355XAMD	SS	1.1 GB
AMD Radeon RX 7600 8GBAMD	SS	1.1 GB
AMD Radeon RX 7700 XTAMD	SS	1.1 GB
AMD Radeon RX 7800 XTAMD	SS	1.1 GB
AMD Radeon RX 7900 XTAMD	SS	1.1 GB
AMD Radeon RX 7900 XTXAMD	SS	1.1 GB
AMD Radeon RX 9070AMD	SS	1.1 GB
AMD Radeon RX 9070 XTAMD	SS	1.1 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	1.1 GB
Apple M4Apple	SS	1.1 GB
Apple M4 Max (40-core GPU)Apple	SS	1.1 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	1.1 GB
Apple M5Apple	SS	1.1 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	1.1 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	1.1 GB
Apple Mac Mini (M1, 2020)Apple	SS	1.1 GB
Apple Mac Mini (M2, 2023)Apple	SS	1.1 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	1.1 GB
Apple Mac Mini (M4, 2024)Apple	SS	1.1 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	1.1 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	1.1 GB

Rows per page

Page 1 of 5

About This Model

Architecture

~1B parameter E-Branchformer encoder with CTC heads (no autoregressive decoder)
128 Mel filterbanks, 8× subsampling → 80 ms frame resolution
Trained on 16 kHz audio in fixed 30 s chunks

What makes it distinctive

Fully open: weights, training data (curated YODAS), code, and logs are all publicly released
Faster than AED models: several× faster inference than Whisper/Canary thanks to CTC greedy decoding
Supports ASR, any-to-any speech translation, language identification, and utterance-level timestamps
Won Best Student Paper at INTERSPEECH 2025 (OWSM v4)

Use cases

Reproducible research, multilingual transcription/translation across 75 languages, forced alignment (CTC segmentation), and as a base for further fine-tuning where training transparency is required.

Find the Best Hardware for This Model

Use our hardware calculator to find the optimal device for running this model.