Kyutai's 2.6B-parameter English-only streaming speech-to-text model, built on the multistream Moshi architecture. Delivers state-of-the-art 6.4% WER on OpenASR Leaderboard while operating in streaming mode with a 2.5 s delay.
A workable 2.6B-parameter dense audio model from Kyutai. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
whisper-timestampedKyutaiSpeechToText*; also MLX and Candle buildsLive captioning, real-time voice assistants, meeting/call transcription, accessibility, and powering downstream LLM voice pipelines (e.g. Unmute).