Cohere

Cohere Transcribe (03-2026)

Cohere Labs' first open-source voice model: a 2B-parameter dedicated ASR transformer that took the #1 spot on the Hugging Face Open ASR Leaderboard (5.42 average WER) at release, with support for 14 enterprise languages.

2B paramsDense

View on Hugging Face Official Page

Our Take

Best for: Open-source asr workloads

A strong 2B-parameter dense audio model from Cohere. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters2B

ArchitectureDense

ProviderCohere

Download Size4.1 GB

Community

Monthly Downloads740.7K

Likes1.0K

Last Updated16 days ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

WER

5.4%

MBA Open Score

70.9AA

Benchmark40%

89.2

Popularity25%

87.0

Efficiency25%

26.1

Versatility10%

70.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	1.7 GB
Acer Veriton GN100 AI MiniAcer	SS	1.7 GB
AMD Instinct MI300XAMD	SS	1.7 GB
AMD Instinct MI325XAMD	SS	1.7 GB
AMD Instinct MI355XAMD	SS	1.7 GB
AMD Radeon RX 7600 8GBAMD	SS	1.7 GB
AMD Radeon RX 7700 XTAMD	SS	1.7 GB
AMD Radeon RX 7800 XTAMD	SS	1.7 GB
AMD Radeon RX 7900 XTAMD	SS	1.7 GB
AMD Radeon RX 7900 XTXAMD	SS	1.7 GB
AMD Radeon RX 9070AMD	SS	1.7 GB
AMD Radeon RX 9070 XTAMD	SS	1.7 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	1.7 GB
Apple M4Apple	SS	1.7 GB
Apple M4 Max (40-core GPU)Apple	SS	1.7 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	1.7 GB
Apple M5Apple	SS	1.7 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	1.7 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	1.7 GB
Apple Mac Mini (M1, 2020)Apple	SS	1.7 GB
Apple Mac Mini (M2, 2023)Apple	SS	1.7 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	1.7 GB
Apple Mac Mini (M4, 2024)Apple	SS	1.7 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	1.7 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	1.7 GB

Rows per page

Page 1 of 5

Rent in the Cloud

Cheapest current cloud rentals with at least 2 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA L4Vast.ai · Spot · 24 GB VRAM	$0.03
NVIDIA L4Vast.ai · On-Demand · 24 GB VRAM	$0.04
NVIDIA GeForce RTX 5060 TiVast.ai · Spot · 16 GB VRAM	$0.09
NVIDIA GeForce RTX 5060 TiVast.ai · On-Demand · 16 GB VRAM	$0.10
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

Cohere Transcribe (03-2026) is Cohere Labs’ first open-source automatic speech recognition (ASR) model — a dedicated 2B-parameter dense transformer built from scratch for high-accuracy audio-to-text transcription. At release, it claimed the #1 spot on the Hugging Face Open ASR Leaderboard with a mean Word Error Rate (WER) of 5.42, outperforming both general-purpose speech models and dedicated ASR systems in its size class. The model is licensed under Apache 2.0, making it freely available for local deployment, fine-tuning, and commercial use.

This model targets practitioners who need reliable, on-premise transcription without cloud dependencies. It competes with other small-footprint ASR models like Whisper small (244M params) and Whisper medium (769M), as well as larger alternatives like Whisper large-v3 (1.5B) or Meta’s MMS. Cohere Transcribe differentiates itself with a higher parameter count for its size, dense architecture (no MoE routing overhead), and a specific focus on 14 enterprise languages. For developers running a local AI model with 2B parameters in 2026, this represents the current state of the art for on-device speech recognition.

Architecture & Technical Details

The model uses a dense transformer architecture with exactly 2 billion parameters. Because it is dense — not mixture-of-experts — all parameters are active during both training and inference. This means the full 2B weights must be loaded into VRAM, and inference latency scales linearly with model size. The tradeoff is consistent, predictable performance: there is no routing logic or expert selection to cause variance in speed or memory usage.

Context length is not officially specified, but as an ASR model processing audio waveforms (sampled at 16 kHz typical), the effective context is tied to the maximum input duration. The model accepts audio files up to 25 MB (per Cohere’s documentation), which translates to roughly 30–40 minutes of speech at standard bitrates. Output is pure text; the model is text-only on the decode side. It does not produce timestamps or speaker diarization out of the box — those tasks require post-processing.

The model was trained from scratch on a multilingual dataset covering 14 languages: English, German, French, Italian, Spanish, Portuguese, Greek, Dutch, Polish, Vietnamese, Chinese (Mandarin), Arabic, Japanese, and Korean. The architecture is a standard encoder-decoder transformer with attention mechanisms optimized for speech features, but Cohere has not published detailed layer counts or hidden dimensions. Practitioners should treat it as a black box with known inputs (16 kHz mono WAV, 25 MB max) and known outputs (plain text with punctuation).

Capabilities & Use Cases

Cohere Transcribe is a pure ASR model — it does not generate summaries, answer questions, or perform translation. Its single capability is converting spoken audio into written text with high accuracy. The model shines in real-world conversational environments, including background noise, multiple speakers, and varied accents, according to Cohere’s internal benchmarks and the Open ASR Leaderboard results.

Specific use cases:

Enterprise meeting transcription: Transcribe boardroom discussions, customer calls, or internal stand-ups with low latency. The 2B size makes it feasible to run on a single GPU for near-real-time processing.
Call center analytics: Convert recorded calls to searchable text for compliance or quality monitoring, without sending data to an external API.
Medical or legal dictation: Accurately transcribe domain-specific vocabulary (the model’s training corpus includes enterprise idioms) in supported languages.
Video captioning: Generate subtitles for pre-recorded or live content, especially in multilingual environments where one of the 14 languages is primary.
Local voice assistants: Run on an edge device (e.g., a laptop with a discrete GPU) to enable wake-word-free voice-to-text for personal automation.

The model’s WER of 5.42 on the leaderboard is an average across several benchmark datasets (LibriSpeech, Common Voice, etc.) and languages. While not perfect (human parity is around 4% WER on clean English), it’s competitive with proprietary cloud ASR services for many practical scenarios.

Running Cohere Transcribe (03-2026) Locally

Because this is an Apache-2.0 model released via Hugging Face, it can be run locally with standard transformers or whisper-type pipelines. The 2B parameter count means VRAM requirements are manageable on consumer hardware, but you will need a GPU with at least 6 GB of VRAM for FP32 inference. Quantization is highly recommended.

VRAM and Quantization

Quantization	VRAM Required (approx.)	Tradeoff
FP32	8 GB	Full accuracy; most demanding
FP16	4 GB	Minimal quality loss; best for most setups
Q8 (8-bit)	2–3 GB	Slight WER increase; good for older GPUs
Q4_K_M (4-bit)	~1.5 GB	Acceptable for non-critical tasks; higher WER

For most users, FP16 strikes the best balance. On an RTX 4090 (24 GB VRAM), you can process audio at roughly 200–300 seconds of audio per second of wall time (real-time factor <0.01). On an M4 Max (64 GB unified memory), FP16 inference runs at similar speeds with virtually no VRAM pressure. Even an RTX 3060 (12 GB) can handle FP16 without issues.

Hardware Recommendations

Best GPU for Cohere Transcribe (03-2026): Any NVIDIA card with at least 8 GB VRAM (RTX 2070, RTX 3060, RTX 4060, and up). For batch processing, an RTX 4090 or A4000 offers headroom.
Consumer GPU: The model runs comfortably on an RTX 3060 12GB at FP16, achieving about 150–200 tokens per second (where tokens correspond to ~80ms of audio per token). Real-time inference (input audio length processed in less than or equal to its duration) is easily achievable.
Apple Silicon: M1 Pro/Max or M2/M3/M4 chips with 16 GB+ RAM run the model via transformers with MPS backend; expect ~100 tokens per second for FP16.
CPU-only: Not recommended for real-time use; expect 5–10 tokens per second.

Getting Started Fast

The quickest way is via the Hugging Face transformers library and the AutoModelForSpeechSeq2Seq pipeline:

1from transformers import pipeline
2
3pipe = pipeline("automatic-speech-recognition",
4                model="CohereLabs/cohere-transcribe-03-2026",
5                device=0)  # GPU
6result = pipe("audio.wav")
7print(result["text"])

For Ollama users, there is no direct Ollama support yet (ASR models are not part of Ollama’s default model library), but you can wrap the pipeline in a custom API. Expect community scripts to appear quickly given the model’s popularity (over 570k downloads).

Performance Notes

Real-time factor: Faster than competing ASR models by up to 3× according to Cohere, meaning a 1-minute audio clip processes in under 20 seconds on a modern GPU.
Batch processing: If processing multiple files, use a batch size of 4–8 on a 24 GB GPU to maximize throughput.
Tokens per second: Not a standard ASR metric; instead measure “seconds of audio per second of inference.” At FP16 on RTX 4090, expect ~300 audio-seconds per second.

How It Compares

Compared to Whisper small (244M) and Whisper medium (769M), Cohere Transcribe is significantly larger (2B vs. 244M), which shows in accuracy: Whisper small averages ~9% WER on English test sets, while Cohere hits ~5.4% across languages. However, Whisper models are multimodal (can produce English translation for any input language) and support 100+ languages. Cohere only handles 14 languages and does not offer translation.

Compared to Whisper large-v3 (1.5B), Cohere Transcribe still leads on the Open ASR Leaderboard average WER (5.42 vs. ~6.0 for Whisper large-v3). But Whisper large-v3 covers many more languages and has better robustness to long audio (30-second context vs. Cohere’s larger effective context). In practice, Whisper large-v3 requires about 1.5 GB less VRAM at FP16, so it’s a viable alternative if you need broader language support or slightly lower memory.

When to choose Cohere Transcribe: You need the highest accuracy in one of the 14 supported languages, want an Apache-2.0 license with no restrictions, and run on mid-range hardware (8–12 GB VRAM). Its real-time factor advantage also makes it attractive for live transcription pipelines.

When to choose a Whisper variant: You need coverage for languages outside the 14, require translation to English, or run on extremely memory-constrained devices (e.g., 4 GB VRAM) where even FP16 Whisper medium fits. Also, Whisper has a more mature ecosystem (e.g., faster-whisper for CPU optimizations).

For local AI model inference on a consumer GPU, Cohere Transcribe delivers the best accuracy-per-parameter ratio in the ASR category, making it the default choice for English-heavy multilingual enterprise workloads in 2026.

Explore the Provider

See all Cohere models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Cohere model we track.

Open Cohere

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

Cohere

Cohere Transcribe (03-2026)

2B paramsDense

View on Hugging Face Official Page

Our Take

Best for: Open-source asr workloads

A strong 2B-parameter dense audio model from Cohere. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters2B

ArchitectureDense

ProviderCohere

Download Size4.1 GB

Community

Monthly Downloads740.7K

Likes1.0K

Last Updated16 days ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

WER

5.4%

MBA Open Score

70.9AA

Benchmark40%

89.2

Popularity25%

87.0

Efficiency25%

26.1

Versatility10%

70.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	1.7 GB
Acer Veriton GN100 AI MiniAcer	SS	1.7 GB
AMD Instinct MI300XAMD	SS	1.7 GB
AMD Instinct MI325XAMD	SS	1.7 GB
AMD Instinct MI355XAMD	SS	1.7 GB
AMD Radeon RX 7600 8GBAMD	SS	1.7 GB
AMD Radeon RX 7700 XTAMD	SS	1.7 GB
AMD Radeon RX 7800 XTAMD	SS	1.7 GB
AMD Radeon RX 7900 XTAMD	SS	1.7 GB
AMD Radeon RX 7900 XTXAMD	SS	1.7 GB
AMD Radeon RX 9070AMD	SS	1.7 GB
AMD Radeon RX 9070 XTAMD	SS	1.7 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	1.7 GB
Apple M4Apple	SS	1.7 GB
Apple M4 Max (40-core GPU)Apple	SS	1.7 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	1.7 GB
Apple M5Apple	SS	1.7 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	1.7 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	1.7 GB
Apple Mac Mini (M1, 2020)Apple	SS	1.7 GB
Apple Mac Mini (M2, 2023)Apple	SS	1.7 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	1.7 GB
Apple Mac Mini (M4, 2024)Apple	SS	1.7 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	1.7 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	1.7 GB

Rows per page

Page 1 of 5

Rent in the Cloud

Cheapest current cloud rentals with at least 2 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA L4Vast.ai · Spot · 24 GB VRAM	$0.03
NVIDIA L4Vast.ai · On-Demand · 24 GB VRAM	$0.04
NVIDIA GeForce RTX 5060 TiVast.ai · Spot · 16 GB VRAM	$0.09
NVIDIA GeForce RTX 5060 TiVast.ai · On-Demand · 16 GB VRAM	$0.10
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

Architecture & Technical Details

Capabilities & Use Cases

Specific use cases:

Enterprise meeting transcription: Transcribe boardroom discussions, customer calls, or internal stand-ups with low latency. The 2B size makes it feasible to run on a single GPU for near-real-time processing.
Call center analytics: Convert recorded calls to searchable text for compliance or quality monitoring, without sending data to an external API.
Medical or legal dictation: Accurately transcribe domain-specific vocabulary (the model’s training corpus includes enterprise idioms) in supported languages.
Video captioning: Generate subtitles for pre-recorded or live content, especially in multilingual environments where one of the 14 languages is primary.
Local voice assistants: Run on an edge device (e.g., a laptop with a discrete GPU) to enable wake-word-free voice-to-text for personal automation.

Running Cohere Transcribe (03-2026) Locally

VRAM and Quantization

Quantization	VRAM Required (approx.)	Tradeoff
FP32	8 GB	Full accuracy; most demanding
FP16	4 GB	Minimal quality loss; best for most setups
Q8 (8-bit)	2–3 GB	Slight WER increase; good for older GPUs
Q4_K_M (4-bit)	~1.5 GB	Acceptable for non-critical tasks; higher WER

Hardware Recommendations

Best GPU for Cohere Transcribe (03-2026): Any NVIDIA card with at least 8 GB VRAM (RTX 2070, RTX 3060, RTX 4060, and up). For batch processing, an RTX 4090 or A4000 offers headroom.
Consumer GPU: The model runs comfortably on an RTX 3060 12GB at FP16, achieving about 150–200 tokens per second (where tokens correspond to ~80ms of audio per token). Real-time inference (input audio length processed in less than or equal to its duration) is easily achievable.
Apple Silicon: M1 Pro/Max or M2/M3/M4 chips with 16 GB+ RAM run the model via transformers with MPS backend; expect ~100 tokens per second for FP16.
CPU-only: Not recommended for real-time use; expect 5–10 tokens per second.

Getting Started Fast

The quickest way is via the Hugging Face transformers library and the AutoModelForSpeechSeq2Seq pipeline:

1from transformers import pipeline
2
3pipe = pipeline("automatic-speech-recognition",
4                model="CohereLabs/cohere-transcribe-03-2026",
5                device=0)  # GPU
6result = pipe("audio.wav")
7print(result["text"])

Performance Notes

Real-time factor: Faster than competing ASR models by up to 3× according to Cohere, meaning a 1-minute audio clip processes in under 20 seconds on a modern GPU.
Batch processing: If processing multiple files, use a batch size of 4–8 on a 24 GB GPU to maximize throughput.
Tokens per second: Not a standard ASR metric; instead measure “seconds of audio per second of inference.” At FP16 on RTX 4090, expect ~300 audio-seconds per second.

How It Compares

Explore the Provider

See all Cohere models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Cohere model we track.

Open Cohere

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.