Kyutai's 2.6B-parameter English-only streaming speech-to-text model, built on the multistream Moshi architecture. Delivers state-of-the-art 6.4% WER on OpenASR Leaderboard while operating in streaming mode with a 2.5 s delay.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Kyutai STT is a decoder-only, streaming speech-to-text model from Kyutai Labs (Paris), leveraging the multistream architecture of Moshi (arXiv:2410.00037) to jointly model audio and text streams.
whisper-timestampedKyutaiSpeechToText*; also MLX and Candle buildsLive captioning, real-time voice assistants, meeting/call transcription, accessibility, and powering downstream LLM voice pipelines (e.g. Unmute).