Kyutai

Kyutai STT 2.6B EN

Kyutai's 2.6B-parameter English-only streaming speech-to-text model, built on the multistream Moshi architecture. Delivers state-of-the-art 6.4% WER on OpenASR Leaderboard while operating in streaming mode with a 2.5 s delay.

2.6B paramsDense

View on Hugging Face Source Code Official Page

Our Take

Best for: Open-source asr workloads

A workable 2.6B-parameter dense audio model from Kyutai. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters2.6B

ArchitectureDense

ProviderKyutai

Download Size16.8 GB

Community

Likes123

Last Updated1 years ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

CC-BY-4.0View Full License

Performance & Scoring

Benchmarks

WER

6.4%

MBA Open Score

51.2CC

Benchmark40%

87.2

Popularity25%

17.8

Efficiency25%

17.4

Versatility10%

75.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	2.1 GB
Acer Veriton GN100 AI MiniAcer	SS	2.1 GB
AMD Instinct MI300XAMD	SS	2.1 GB
AMD Instinct MI325XAMD	SS	2.1 GB
AMD Instinct MI355XAMD	SS	2.1 GB
AMD Radeon RX 7600 8GBAMD	SS	2.1 GB
AMD Radeon RX 7700 XTAMD	SS	2.1 GB
AMD Radeon RX 7800 XTAMD	SS	2.1 GB
AMD Radeon RX 7900 XTAMD	SS	2.1 GB
AMD Radeon RX 7900 XTXAMD	SS	2.1 GB
AMD Radeon RX 9070AMD	SS	2.1 GB
AMD Radeon RX 9070 XTAMD	SS	2.1 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	2.1 GB
Apple M4Apple	SS	2.1 GB
Apple M4 Max (40-core GPU)Apple	SS	2.1 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	2.1 GB
Apple M5Apple	SS	2.1 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	2.1 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	2.1 GB
Apple Mac Mini (M1, 2020)Apple	SS	2.1 GB
Apple Mac Mini (M2, 2023)Apple	SS	2.1 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	2.1 GB
Apple Mac Mini (M4, 2024)Apple	SS	2.1 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	2.1 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	2.1 GB

Rows per page

Page 1 of 5

Rent in the Cloud

Cheapest current cloud rentals with at least 2 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA L4Vast.ai · Spot · 24 GB VRAM	$0.03
NVIDIA L4Vast.ai · On-Demand · 24 GB VRAM	$0.04
NVIDIA GeForce RTX 5060 TiVast.ai · Spot · 16 GB VRAM	$0.09
NVIDIA GeForce RTX 5060 TiVast.ai · On-Demand · 16 GB VRAM	$0.10
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

Kyutai STT 2.6B EN is a streaming speech-to-text model built by the Paris-based open-science lab Kyutai. At 2.6 billion parameters, it is one of the most accurate streaming ASR models available today, achieving 6.4% word error rate (WER) on the OpenASR Leaderboard while operating with only a 2.5-second delay. Unlike traditional offline ASR models that require the entire audio clip before transcribing, Kyutai STT outputs text incrementally as audio streams in — a design that makes it practical for voice agents, live captioning, and any application where low latency matters more than waiting for the full recording.

The model is English-only and uses a dense Transformer decoder architecture derived from the multistream framework of Kyutai’s Moshi. It processes audio tokenized by the Mimi neural codec at 12.5 Hz, with each frame represented by 32 audio tokens. The text stream is shifted relative to the audio stream, enabling the model to predict the next word based on the preceding 2.5 seconds of speech. This deterministic delay is a trade-off: you get state-of-the-art accuracy for a streaming model, but you accept that the first words of a sentence appear 2.5 seconds after they are spoken.

Licensed under CC-BY-4.0, the weights are freely available for commercial and research use. The model competes directly with large streaming ASR systems such as Whisper (which is inherently offline, though can be adapted to streaming with segmentation) and other real-time models like NVIDIA Parakeet or Google’s streaming models. Kyutai STT 2.6B EN distinguishes itself through its native streaming architecture, batching efficiency, and the ability to run on consumer GPUs.

Architecture & Technical Details

Kyutai STT 2.6B EN is a dense, decoder-only Transformer — not a mixture-of-experts (MoE). This means all 2.6 billion parameters are active during every forward pass. For inference, this translates to predictable VRAM usage and consistent runtime, but it also means the model is heavier on memory than an MoE model with the same total parameter count where only a fraction of parameters are used per token.

The audio frontend uses Kyutai’s Mimi codec to convert raw audio into discrete tokens at 12.5 frames per second. Each frame is represented by 32 audio tokens, yielding a total of 400 audio tokens per second. These tokens are fed into the Transformer, which predicts a stream of text tokens. The text stream is offset by 2.5 seconds (31 frames) from the audio stream — this delay is the key to the model’s streaming behavior: it can only begin outputting text once the first 2.5 seconds of audio have been consumed.

Context length is not officially specified, but the model has been demonstrated to handle audio segments up to two hours in length without degradation. This suggests the positional encoding (likely relative or RoPE) supports long sequences. Given the high token rate (400 audio tokens/second), two hours of audio would require approximately 2.88 million audio tokens — an impressive practical context if confirmed.

The model outputs text with proper capitalization and punctuation. Word-level timestamps are recovered by subtracting the 2.5-second offset from the frame index of each predicted token. This is a straightforward post-processing step that Kyutai’s inference code handles.

The architecture is fully open-source, with training details and checkpoints available on Hugging Face. The model was pretrained on 2.5 million hours of public audio with synthetic transcripts from Whisper, then fine-tuned on smaller, high-quality datasets with ground-truth transcripts.

Capabilities & Use Cases

Kyutai STT 2.6B EN is designed for one thing: streaming English speech recognition. It does not perform speaker diarization, language identification, or emotion detection. What it does, it does well.

Streaming transcription with low latency: The 2.5-second delay means listeners see words appear roughly two seconds after they were spoken. This is more than fast enough for live captioning, voice assistants, and real-time meeting transcription. It is not suitable for push-to-talk or instant-feedback scenarios where sub-second latency is required (for those, the 1B variant with 0.5-second delay is a better fit).
Robustness to noise and long audio: The model handles noisy environments (e.g., background chatter, room echo) without significant degradation, according to Kyutai’s benchmarks. It can transcribe audio clips up to two hours long without re-initialization, making it practical for podcast transcription, lecture recording, or call center logs.
Batched inference for high throughput: On an H100 GPU, the model can process 400 concurrent audio streams in real-time. This batching capability makes it viable for server-side deployments where many simultaneous transcription sessions are needed (e.g., a large call center or live event with multiple speakers).
Word-level timestamps: Each word is associated with a precise time offset, useful for aligning transcriptions with video or for downstream NLP tasks like named entity recognition in context.
Punctuation and capitalization: The output is readable and formatted, reducing post-processing overhead.

Real-world use cases include:

Unmute: Kyutai’s own open-source voice AI pipeline that combines STT with a text-to-speech model for real-time voice-based LLM interaction.
Live streaming subtitles for events or broadcasts.
Voice-controlled applications where a low-latency transcript is fed into a reasoning or search loop.
Offline transcription of long audio files (the model can run in non-streaming mode by feeding audio in chunks, though the 2.5-second padding applies).

Running Kyutai STT 2.6B EN Locally

This is where Kyutai STT 2.6B EN becomes interesting for practitioners: it can run on consumer hardware with moderate VRAM, and its streaming architecture means you don’t need a cloud API to get low-latency transcriptions.

VRAM requirements

At full fp16 precision, the model consumes roughly 5.2 GB of VRAM for weights, plus an additional 1–2 GB for runtime buffers (audio codec, attention cache, temporary tensors). The peak instantaneous memory during a streaming forward pass is about 8–10 GB. This means an NVIDIA RTX 3090 or RTX 4090 (24 GB VRAM) runs it comfortably at full precision. An M4 Max with 48 GB unified memory or an M3 Ultra will also run it without issues.

For users with less VRAM, quantization is effective:

Q4_K_M (4-bit quantization): Reduces weight memory to ~1.4 GB. Peak memory drops to ~4–5 GB. This fits on an RTX 3060 (12 GB), RTX 4060 Ti (16 GB), or even an M2 MacBook Air with 16 GB RAM. Quality loss on WER is typically <0.5% absolute.
Q8_0 (8-bit quantization): Weights take ~2.8 GB. Recommended if you have at least 8 GB VRAM.

The model is not yet widely available on Ollama, but you can run it using Kyutai’s own delayed-streams-modeling repository, which provides Python scripts and stt-rs for Rust-based inference. The Hugging Face transformers integration from version 4.53.0 also supports the model natively via kyutai/stt-2.6b-en-trfs, offering a familiar API for PyTorch users.

Expected performance

Exact tokens-per-second (TPS) metrics are not published, but you can derive a reasonable estimate: the model processes 400 audio tokens per second of real-time audio. For each audio token, it generates one text token at a time. Given that the model has 2.6B parameters and operates on a transformer decoder, inference speed is primarily limited by memory bandwidth. On an RTX 4090 with full fp16, expect to process real-time audio at roughly 2–3× real-time (i.e., a 10-second audio clip transcribes in 3–5 seconds wall-clock). With Q4_K_M on an RTX 4060, expect 1–2× real-time — still usable for live captioning.

For batch processing, the model shines: the H100 figure of 400 concurrent streams suggests that any modern GPU with enough VRAM can handle dozens of parallel streams simultaneously. This is a key advantage over offline models that would need to queue and segment audio.

Hardware recommendations

GPU	VRAM	Precision/Quantization	Expected Use
RTX 4090, RTX 3090	24 GB	fp16	Single-stream with headroom
RTX 4070 Ti, RTX 4080	16 GB	Q8_0 or Q4_K_M	Good for single-stream; batching limited
RTX 3060, RTX 4060	12 GB	Q4_K_M	Single-stream real-time; no batching
Apple M4 Max (64GB)	64 GB	fp16	Single-stream with high throughput
Apple M2 (16 GB)	16 GB	Q4_K_M	Real-time transcription

For the quickest local setup, use the Hugging Face transformers pipeline with the stt-2.6b-en-trfs checkpoint. If you need maximum performance and batching, the Rust inference (stt-rs) from the delayed-streams-modeling repo is faster and more memory-efficient.

How It Compares

vs OpenAI Whisper large-v3 (1.5B parameters)

Whisper large-v3 is a non-streaming encoder-decoder model. It achieves ~8–9% WER on English (slightly better with prompt tuning). Kyutai STT 2.6B EN achieves 6.4% WER while streaming — a significant accuracy advantage. However, Whisper can be run offline with no delay, whereas Kyutai STT has a fixed 2.5-second latency. Whisper also supports 100+ languages; Kyutai STT 2.6B EN is English-only.

Choose Kyutai STT 2.6B EN if you need streaming, lower latency than Whisper’s full-audio approach, and higher accuracy for English. Choose Whisper if you need multilingual support or work with offline audio where latency is irrelevant.

vs Kyutai STT 1B EN-FR (1B params, 0.5s delay)

Kyutai’s own 1B model trades accuracy for latency. The 1B model has a 0.5-second delay (5x faster) but higher WER (likely ~8–9% based on typical scaling). It also supports French. The 2.6B model is strictly better for accuracy and English-only; the 1B is better for interactive voice agents that need near-instant feedback or for multilingual use.

vs NVIDIA Parakeet-CTC (0.6B params)

Parakeet-CTC is a faster, streaming-compatible model using CTC loss. It is more memory efficient (600M params) and can run on CPUs, but its WER is higher (double-digit). Kyutai STT 2.6B EN is in a different class: it offers near-offline accuracy in a streaming form factor, making it suitable for production-grade transcription where quality cannot be compromised.

In summary, Kyutai STT 2.6B EN occupies a unique niche: it is the most accurate streaming English ASR model that fits on a single consumer GPU. For developers building voice applications on local hardware, it is the current best option — provided the 2.5-second latency is acceptable for the use case.

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

Kyutai

Kyutai STT 2.6B EN

2.6B paramsDense

View on Hugging Face Source Code Official Page

Our Take

Best for: Open-source asr workloads

A workable 2.6B-parameter dense audio model from Kyutai. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters2.6B

ArchitectureDense

ProviderKyutai

Download Size16.8 GB

Community

Likes123

Last Updated1 years ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

CC-BY-4.0View Full License

Performance & Scoring

Benchmarks

WER

6.4%

MBA Open Score

51.2CC

Benchmark40%

87.2

Popularity25%

17.8

Efficiency25%

17.4

Versatility10%

75.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	2.1 GB
Acer Veriton GN100 AI MiniAcer	SS	2.1 GB
AMD Instinct MI300XAMD	SS	2.1 GB
AMD Instinct MI325XAMD	SS	2.1 GB
AMD Instinct MI355XAMD	SS	2.1 GB
AMD Radeon RX 7600 8GBAMD	SS	2.1 GB
AMD Radeon RX 7700 XTAMD	SS	2.1 GB
AMD Radeon RX 7800 XTAMD	SS	2.1 GB
AMD Radeon RX 7900 XTAMD	SS	2.1 GB
AMD Radeon RX 7900 XTXAMD	SS	2.1 GB
AMD Radeon RX 9070AMD	SS	2.1 GB
AMD Radeon RX 9070 XTAMD	SS	2.1 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	2.1 GB
Apple M4Apple	SS	2.1 GB
Apple M4 Max (40-core GPU)Apple	SS	2.1 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	2.1 GB
Apple M5Apple	SS	2.1 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	2.1 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	2.1 GB
Apple Mac Mini (M1, 2020)Apple	SS	2.1 GB
Apple Mac Mini (M2, 2023)Apple	SS	2.1 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	2.1 GB
Apple Mac Mini (M4, 2024)Apple	SS	2.1 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	2.1 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	2.1 GB

Rows per page

Page 1 of 5

Rent in the Cloud

Cheapest current cloud rentals with at least 2 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA L4Vast.ai · Spot · 24 GB VRAM	$0.03
NVIDIA L4Vast.ai · On-Demand · 24 GB VRAM	$0.04
NVIDIA GeForce RTX 5060 TiVast.ai · Spot · 16 GB VRAM	$0.09
NVIDIA GeForce RTX 5060 TiVast.ai · On-Demand · 16 GB VRAM	$0.10
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

Architecture & Technical Details

Capabilities & Use Cases

Kyutai STT 2.6B EN is designed for one thing: streaming English speech recognition. It does not perform speaker diarization, language identification, or emotion detection. What it does, it does well.

Streaming transcription with low latency: The 2.5-second delay means listeners see words appear roughly two seconds after they were spoken. This is more than fast enough for live captioning, voice assistants, and real-time meeting transcription. It is not suitable for push-to-talk or instant-feedback scenarios where sub-second latency is required (for those, the 1B variant with 0.5-second delay is a better fit).
Robustness to noise and long audio: The model handles noisy environments (e.g., background chatter, room echo) without significant degradation, according to Kyutai’s benchmarks. It can transcribe audio clips up to two hours long without re-initialization, making it practical for podcast transcription, lecture recording, or call center logs.
Batched inference for high throughput: On an H100 GPU, the model can process 400 concurrent audio streams in real-time. This batching capability makes it viable for server-side deployments where many simultaneous transcription sessions are needed (e.g., a large call center or live event with multiple speakers).
Word-level timestamps: Each word is associated with a precise time offset, useful for aligning transcriptions with video or for downstream NLP tasks like named entity recognition in context.
Punctuation and capitalization: The output is readable and formatted, reducing post-processing overhead.

Real-world use cases include:

Unmute: Kyutai’s own open-source voice AI pipeline that combines STT with a text-to-speech model for real-time voice-based LLM interaction.
Live streaming subtitles for events or broadcasts.
Voice-controlled applications where a low-latency transcript is fed into a reasoning or search loop.
Offline transcription of long audio files (the model can run in non-streaming mode by feeding audio in chunks, though the 2.5-second padding applies).

Running Kyutai STT 2.6B EN Locally

VRAM requirements

For users with less VRAM, quantization is effective:

Q4_K_M (4-bit quantization): Reduces weight memory to ~1.4 GB. Peak memory drops to ~4–5 GB. This fits on an RTX 3060 (12 GB), RTX 4060 Ti (16 GB), or even an M2 MacBook Air with 16 GB RAM. Quality loss on WER is typically <0.5% absolute.
Q8_0 (8-bit quantization): Weights take ~2.8 GB. Recommended if you have at least 8 GB VRAM.

Expected performance

Hardware recommendations

GPU	VRAM	Precision/Quantization	Expected Use
RTX 4090, RTX 3090	24 GB	fp16	Single-stream with headroom
RTX 4070 Ti, RTX 4080	16 GB	Q8_0 or Q4_K_M	Good for single-stream; batching limited
RTX 3060, RTX 4060	12 GB	Q4_K_M	Single-stream real-time; no batching
Apple M4 Max (64GB)	64 GB	fp16	Single-stream with high throughput
Apple M2 (16 GB)	16 GB	Q4_K_M	Real-time transcription

How It Compares

vs OpenAI Whisper large-v3 (1.5B parameters)

vs Kyutai STT 1B EN-FR (1B params, 0.5s delay)

vs NVIDIA Parakeet-CTC (0.6B params)

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.