
Cohere Labs' first open-source voice model: a 2B-parameter dedicated ASR transformer that took the #1 spot on the Hugging Face Open ASR Leaderboard (5.42 average WER) at release, with support for 14 enterprise languages.
A strong 2B-parameter dense audio model from Cohere. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Cheapest current cloud rentals with at least 2 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM | $0.11 |
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM | $0.13 |
NVIDIA GeForce RTX 4090Vast.ai · Spot · 24 GB VRAM | $0.13 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
Cohere Transcribe (03-2026) is Cohere Labs’ first open-source automatic speech recognition (ASR) model — a dedicated 2B-parameter dense transformer built from scratch for high-accuracy audio-to-text transcription. At release, it claimed the #1 spot on the Hugging Face Open ASR Leaderboard with a mean Word Error Rate (WER) of 5.42, outperforming both general-purpose speech models and dedicated ASR systems in its size class. The model is licensed under Apache 2.0, making it freely available for local deployment, fine-tuning, and commercial use.
This model targets practitioners who need reliable, on-premise transcription without cloud dependencies. It competes with other small-footprint ASR models like Whisper small (244M params) and Whisper medium (769M), as well as larger alternatives like Whisper large-v3 (1.5B) or Meta’s MMS. Cohere Transcribe differentiates itself with a higher parameter count for its size, dense architecture (no MoE routing overhead), and a specific focus on 14 enterprise languages. For developers running a local AI model with 2B parameters in 2026, this represents the current state of the art for on-device speech recognition.
The model uses a dense transformer architecture with exactly 2 billion parameters. Because it is dense — not mixture-of-experts — all parameters are active during both training and inference. This means the full 2B weights must be loaded into VRAM, and inference latency scales linearly with model size. The tradeoff is consistent, predictable performance: there is no routing logic or expert selection to cause variance in speed or memory usage.
Context length is not officially specified, but as an ASR model processing audio waveforms (sampled at 16 kHz typical), the effective context is tied to the maximum input duration. The model accepts audio files up to 25 MB (per Cohere’s documentation), which translates to roughly 30–40 minutes of speech at standard bitrates. Output is pure text; the model is text-only on the decode side. It does not produce timestamps or speaker diarization out of the box — those tasks require post-processing.
The model was trained from scratch on a multilingual dataset covering 14 languages: English, German, French, Italian, Spanish, Portuguese, Greek, Dutch, Polish, Vietnamese, Chinese (Mandarin), Arabic, Japanese, and Korean. The architecture is a standard encoder-decoder transformer with attention mechanisms optimized for speech features, but Cohere has not published detailed layer counts or hidden dimensions. Practitioners should treat it as a black box with known inputs (16 kHz mono WAV, 25 MB max) and known outputs (plain text with punctuation).
Cohere Transcribe is a pure ASR model — it does not generate summaries, answer questions, or perform translation. Its single capability is converting spoken audio into written text with high accuracy. The model shines in real-world conversational environments, including background noise, multiple speakers, and varied accents, according to Cohere’s internal benchmarks and the Open ASR Leaderboard results.
Specific use cases:
The model’s WER of 5.42 on the leaderboard is an average across several benchmark datasets (LibriSpeech, Common Voice, etc.) and languages. While not perfect (human parity is around 4% WER on clean English), it’s competitive with proprietary cloud ASR services for many practical scenarios.
Because this is an Apache-2.0 model released via Hugging Face, it can be run locally with standard transformers or whisper-type pipelines. The 2B parameter count means VRAM requirements are manageable on consumer hardware, but you will need a GPU with at least 6 GB of VRAM for FP32 inference. Quantization is highly recommended.
| Quantization | VRAM Required (approx.) | Tradeoff |
|---|---|---|
| FP32 | 8 GB | Full accuracy; most demanding |
| FP16 | 4 GB | Minimal quality loss; best for most setups |
| Q8 (8-bit) | 2–3 GB | Slight WER increase; good for older GPUs |
| Q4_K_M (4-bit) | ~1.5 GB | Acceptable for non-critical tasks; higher WER |
For most users, FP16 strikes the best balance. On an RTX 4090 (24 GB VRAM), you can process audio at roughly 200–300 seconds of audio per second of wall time (real-time factor <0.01). On an M4 Max (64 GB unified memory), FP16 inference runs at similar speeds with virtually no VRAM pressure. Even an RTX 3060 (12 GB) can handle FP16 without issues.
transformers with MPS backend; expect ~100 tokens per second for FP16.The quickest way is via the Hugging Face transformers library and the AutoModelForSpeechSeq2Seq pipeline:
1from transformers import pipeline23pipe = pipeline("automatic-speech-recognition",4 model="CohereLabs/cohere-transcribe-03-2026",5 device=0) # GPU6result = pipe("audio.wav")7print(result["text"])
For Ollama users, there is no direct Ollama support yet (ASR models are not part of Ollama’s default model library), but you can wrap the pipeline in a custom API. Expect community scripts to appear quickly given the model’s popularity (over 570k downloads).
Compared to Whisper small (244M) and Whisper medium (769M), Cohere Transcribe is significantly larger (2B vs. 244M), which shows in accuracy: Whisper small averages ~9% WER on English test sets, while Cohere hits ~5.4% across languages. However, Whisper models are multimodal (can produce English translation for any input language) and support 100+ languages. Cohere only handles 14 languages and does not offer translation.
Compared to Whisper large-v3 (1.5B), Cohere Transcribe still leads on the Open ASR Leaderboard average WER (5.42 vs. ~6.0 for Whisper large-v3). But Whisper large-v3 covers many more languages and has better robustness to long audio (30-second context vs. Cohere’s larger effective context). In practice, Whisper large-v3 requires about 1.5 GB less VRAM at FP16, so it’s a viable alternative if you need broader language support or slightly lower memory.
When to choose Cohere Transcribe: You need the highest accuracy in one of the 14 supported languages, want an Apache-2.0 license with no restrictions, and run on mid-range hardware (8–12 GB VRAM). Its real-time factor advantage also makes it attractive for live transcription pipelines.
When to choose a Whisper variant: You need coverage for languages outside the 14, require translation to English, or run on extremely memory-constrained devices (e.g., 4 GB VRAM) where even FP16 Whisper medium fits. Also, Whisper has a more mature ecosystem (e.g., faster-whisper for CPU optimizations).
For local AI model inference on a consumer GPU, Cohere Transcribe delivers the best accuracy-per-parameter ratio in the ASR category, making it the default choice for English-heavy multilingual enterprise workloads in 2026.

Explore the Provider
Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Cohere model we track.