Kyutai

Kyutai STT 2.6B EN

Kyutai's 2.6B-parameter English-only streaming speech-to-text model, built on the multistream Moshi architecture. Delivers state-of-the-art 6.4% WER on OpenASR Leaderboard while operating in streaming mode with a 2.5 s delay.

2.6B paramsDense

View on Hugging Face Source Code Official Page

Model Specifications

Parameters2.6B

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

CC-BY-4.0View Full License

Performance & Scoring

Benchmarks

WER

6.4%

Overall Score

51.5CC

Benchmark40%

87.2

Popularity25%

18.7

Efficiency25%

17.8

Versatility10%

75.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

85 devices


Acer Veriton GN100 AI MiniAcer	SS	2.1 GB
AMD Instinct MI300XAMD	SS	2.1 GB
AMD Instinct MI325XAMD	SS	2.1 GB
AMD Instinct MI355XAMD	SS	2.1 GB
AMD Radeon RX 7600 8GBAMD	SS	2.1 GB
AMD Radeon RX 7700 XTAMD	SS	2.1 GB
AMD Radeon RX 7800 XTAMD	SS	2.1 GB
AMD Radeon RX 7900 XTAMD	SS	2.1 GB
AMD Radeon RX 7900 XTXAMD	SS	2.1 GB
AMD Radeon RX 9070AMD	SS	2.1 GB
AMD Radeon RX 9070 XTAMD	SS	2.1 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	2.1 GB
Apple M4Apple	SS	2.1 GB
Apple M4 Max (40-core GPU)Apple	SS	2.1 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	2.1 GB
Apple M5Apple	SS	2.1 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	2.1 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	2.1 GB
Apple Mac Mini (M1, 2020)Apple	SS	2.1 GB
Apple Mac Mini (M2, 2023)Apple	SS	2.1 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	2.1 GB
Apple Mac Mini (M4, 2024)Apple	SS	2.1 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	2.1 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	2.1 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	SS	2.1 GB

Rows per page

Page 1 of 4

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.

About This Model

Kyutai STT 2.6B EN

Kyutai STT is a decoder-only, streaming speech-to-text model from Kyutai Labs (Paris), leveraging the multistream architecture of Moshi (arXiv:2410.00037) to jointly model audio and text streams.

Architecture

Decoder-only Transformer (~2.6B params)
Audio is tokenized by Mimi (Moshi's neural audio codec) at 12.5 Hz, with 32 audio tokens per frame
Text stream is shifted 2.5 s relative to audio, so the model predicts text tokens from preceding audio — yielding natural streaming
Sampling rate: 24 kHz (via Mimi)
Produces capitalized, punctuated transcripts with recoverable token timestamps

Training

Pretraining: 2.5 M hours of public audio with synthetic transcripts from whisper-timestamped
Finetuning: 24k hours of public datasets with ground-truth labels
Long-form finetuning: concatenated LibriSpeech (1000 h) + 22k h synthetic dialogs
48× H100 for pretraining, 16× H100 for finetuning

What makes it distinctive

Best-in-class WER (6.4%) among the open streaming ASR models surveyed
Streaming with only 2.5 s delay, robust to noisy conditions and audio up to 2 hours long
Native integration in Hugging Face Transformers (≥ 4.53) via KyutaiSpeechToText*; also MLX and Candle builds
Based on the same technology stack as Kyutai's Moshi full-duplex voice model

Use cases

Live captioning, real-time voice assistants, meeting/call transcription, accessibility, and powering downstream LLM voice pipelines (e.g. Unmute).

Kyutai STT 2.6B EN

Model Specifications

Quick Start

Download from Hugging Face

License

Performance & Scoring

Benchmarks

Overall Score

Hardware Compatibility

Find the best hardware for this model

Community

About This Model

Kyutai STT 2.6B EN

Architecture

Training

What makes it distinctive

Use cases