Cohere

Cohere Transcribe (03-2026)

Cohere Labs' first open-source voice model: a 2B-parameter dedicated ASR transformer that took the #1 spot on the Hugging Face Open ASR Leaderboard (5.42 average WER) at release, with support for 14 enterprise languages.

2B paramsDense

View on Hugging Face Official Page

Model Specifications

Parameters2B

ArchitectureDense

ProviderCohere

Download Size4.1 GB

Community

Monthly Downloads299.0K

Likes908

Last Updated3 days ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

WER

5.4%

Overall Score

69.7BB

Benchmark40%

89.2

Popularity25%

81.3

Efficiency25%

26.7

Versatility10%

70.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

83 devices


Acer Veriton GN100 AI MiniAcer	SS	1.7 GB
AMD Instinct MI300XAMD	SS	1.7 GB
AMD Instinct MI325XAMD	SS	1.7 GB
AMD Instinct MI355XAMD	SS	1.7 GB
AMD Radeon RX 7600 8GBAMD	SS	1.7 GB
AMD Radeon RX 7700 XTAMD	SS	1.7 GB
AMD Radeon RX 7800 XTAMD	SS	1.7 GB
AMD Radeon RX 7900 XTAMD	SS	1.7 GB
AMD Radeon RX 7900 XTXAMD	SS	1.7 GB
AMD Radeon RX 9070AMD	SS	1.7 GB
AMD Radeon RX 9070 XTAMD	SS	1.7 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	1.7 GB
Apple M4Apple	SS	1.7 GB
Apple M4 Max (40-core GPU)Apple	SS	1.7 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	1.7 GB
Apple M5Apple	SS	1.7 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	1.7 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	1.7 GB
Apple Mac Mini (M1, 2020)Apple	SS	1.7 GB
Apple Mac Mini (M2, 2023)Apple	SS	1.7 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	1.7 GB
Apple Mac Mini (M4, 2024)Apple	SS	1.7 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	1.7 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	1.7 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	SS	1.7 GB

Rows per page

Page 1 of 4

About This Model

Overview

cohere-transcribe-03-2026 is Cohere Labs' first audio model, a dedicated 2B-parameter audio-in / text-out automatic speech recognition model trained from scratch (no Whisper distillation) with supervised cross-entropy.

Architecture

Encoder-decoder cross-attention transformer with a Fast-Conformer encoder (holding the majority of the 2B parameters) and a lightweight Transformer decoder.
Input: raw waveform → log-mel spectrogram (auto-resampled to 16 kHz, multi-channel averaged to mono).
16k multilingual BPE tokenizer with byte fallback.
Customizable punctuation via prompt; automatic long-form chunking.

Training Data

~0.5M hours of curated audio-transcript pairs + synthetic augmentations.
Non-speech background-noise augmentation (0–30 dB SNR).
Proprietary mix-balancing and audio decontamination checks for test/train overlap.

Supported Languages (14)

English, German, French, Italian, Spanish, Portuguese, Greek, Dutch, Polish, Arabic, Vietnamese, Chinese (Mandarin), Japanese, Korean. No automatic language detection — language must be specified explicitly.

Performance

#1 on the Hugging Face Open ASR Leaderboard as of 2026-03-26 with an average WER of 5.42 across 8 English benchmarks.
~525× real-time (RTFx ≈ 524.88), roughly 3× faster than comparably sized ASR models.
61% average human preference win rate vs. competing ASR models.
Ranks 2nd among open-source models on the multilingual ASR leaderboard.

Deployment

Supported natively in transformers (CohereAsrForConditionalGeneration), vLLM (/v1/audio/transcriptions), Apple Silicon, browser, and mobile; 18 quantized variants on the Hub.

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.

Cohere

Cohere Transcribe (03-2026)

2B paramsDense

View on Hugging Face Official Page

Model Specifications

Parameters2B

ArchitectureDense

ProviderCohere

Download Size4.1 GB

Community

Monthly Downloads299.0K

Likes908

Last Updated3 days ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

WER

5.4%

Overall Score

69.7BB

Benchmark40%

89.2

Popularity25%

81.3

Efficiency25%

26.7

Versatility10%

70.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

83 devices


Acer Veriton GN100 AI MiniAcer	SS	1.7 GB
AMD Instinct MI300XAMD	SS	1.7 GB
AMD Instinct MI325XAMD	SS	1.7 GB
AMD Instinct MI355XAMD	SS	1.7 GB
AMD Radeon RX 7600 8GBAMD	SS	1.7 GB
AMD Radeon RX 7700 XTAMD	SS	1.7 GB
AMD Radeon RX 7800 XTAMD	SS	1.7 GB
AMD Radeon RX 7900 XTAMD	SS	1.7 GB
AMD Radeon RX 7900 XTXAMD	SS	1.7 GB
AMD Radeon RX 9070AMD	SS	1.7 GB
AMD Radeon RX 9070 XTAMD	SS	1.7 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	1.7 GB
Apple M4Apple	SS	1.7 GB
Apple M4 Max (40-core GPU)Apple	SS	1.7 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	1.7 GB
Apple M5Apple	SS	1.7 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	1.7 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	1.7 GB
Apple Mac Mini (M1, 2020)Apple	SS	1.7 GB
Apple Mac Mini (M2, 2023)Apple	SS	1.7 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	1.7 GB
Apple Mac Mini (M4, 2024)Apple	SS	1.7 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	1.7 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	1.7 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	SS	1.7 GB

Rows per page

Page 1 of 4

About This Model

Overview

Architecture

Encoder-decoder cross-attention transformer with a Fast-Conformer encoder (holding the majority of the 2B parameters) and a lightweight Transformer decoder.
Input: raw waveform → log-mel spectrogram (auto-resampled to 16 kHz, multi-channel averaged to mono).
16k multilingual BPE tokenizer with byte fallback.
Customizable punctuation via prompt; automatic long-form chunking.

Training Data

~0.5M hours of curated audio-transcript pairs + synthetic augmentations.
Non-speech background-noise augmentation (0–30 dB SNR).
Proprietary mix-balancing and audio decontamination checks for test/train overlap.

Supported Languages (14)

Performance

#1 on the Hugging Face Open ASR Leaderboard as of 2026-03-26 with an average WER of 5.42 across 8 English benchmarks.
~525× real-time (RTFx ≈ 524.88), roughly 3× faster than comparably sized ASR models.
61% average human preference win rate vs. competing ASR models.
Ranks 2nd among open-source models on the multilingual ASR leaderboard.

Deployment

Supported natively in transformers (CohereAsrForConditionalGeneration), vLLM (/v1/audio/transcriptions), Apple Silicon, browser, and mobile; 18 quantized variants on the Hub.

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.