
Kyutai's 2.6B-parameter English-only streaming speech-to-text model, built on the multistream Moshi architecture. Delivers state-of-the-art 6.4% WER on OpenASR Leaderboard while operating in streaming mode with a 2.5 s delay.
A workable 2.6B-parameter dense audio model from Kyutai. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Cheapest current cloud rentals with at least 2 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM | $0.11 |
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM | $0.13 |
NVIDIA GeForce RTX 4090Vast.ai · Spot · 24 GB VRAM | $0.13 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
Kyutai STT 2.6B EN is a streaming speech-to-text model built by the Paris-based open-science lab Kyutai. At 2.6 billion parameters, it is one of the most accurate streaming ASR models available today, achieving 6.4% word error rate (WER) on the OpenASR Leaderboard while operating with only a 2.5-second delay. Unlike traditional offline ASR models that require the entire audio clip before transcribing, Kyutai STT outputs text incrementally as audio streams in — a design that makes it practical for voice agents, live captioning, and any application where low latency matters more than waiting for the full recording.
The model is English-only and uses a dense Transformer decoder architecture derived from the multistream framework of Kyutai’s Moshi. It processes audio tokenized by the Mimi neural codec at 12.5 Hz, with each frame represented by 32 audio tokens. The text stream is shifted relative to the audio stream, enabling the model to predict the next word based on the preceding 2.5 seconds of speech. This deterministic delay is a trade-off: you get state-of-the-art accuracy for a streaming model, but you accept that the first words of a sentence appear 2.5 seconds after they are spoken.
Licensed under CC-BY-4.0, the weights are freely available for commercial and research use. The model competes directly with large streaming ASR systems such as Whisper (which is inherently offline, though can be adapted to streaming with segmentation) and other real-time models like NVIDIA Parakeet or Google’s streaming models. Kyutai STT 2.6B EN distinguishes itself through its native streaming architecture, batching efficiency, and the ability to run on consumer GPUs.
Kyutai STT 2.6B EN is a dense, decoder-only Transformer — not a mixture-of-experts (MoE). This means all 2.6 billion parameters are active during every forward pass. For inference, this translates to predictable VRAM usage and consistent runtime, but it also means the model is heavier on memory than an MoE model with the same total parameter count where only a fraction of parameters are used per token.
The audio frontend uses Kyutai’s Mimi codec to convert raw audio into discrete tokens at 12.5 frames per second. Each frame is represented by 32 audio tokens, yielding a total of 400 audio tokens per second. These tokens are fed into the Transformer, which predicts a stream of text tokens. The text stream is offset by 2.5 seconds (31 frames) from the audio stream — this delay is the key to the model’s streaming behavior: it can only begin outputting text once the first 2.5 seconds of audio have been consumed.
Context length is not officially specified, but the model has been demonstrated to handle audio segments up to two hours in length without degradation. This suggests the positional encoding (likely relative or RoPE) supports long sequences. Given the high token rate (400 audio tokens/second), two hours of audio would require approximately 2.88 million audio tokens — an impressive practical context if confirmed.
The model outputs text with proper capitalization and punctuation. Word-level timestamps are recovered by subtracting the 2.5-second offset from the frame index of each predicted token. This is a straightforward post-processing step that Kyutai’s inference code handles.
The architecture is fully open-source, with training details and checkpoints available on Hugging Face. The model was pretrained on 2.5 million hours of public audio with synthetic transcripts from Whisper, then fine-tuned on smaller, high-quality datasets with ground-truth transcripts.
Kyutai STT 2.6B EN is designed for one thing: streaming English speech recognition. It does not perform speaker diarization, language identification, or emotion detection. What it does, it does well.
Real-world use cases include:
This is where Kyutai STT 2.6B EN becomes interesting for practitioners: it can run on consumer hardware with moderate VRAM, and its streaming architecture means you don’t need a cloud API to get low-latency transcriptions.
At full fp16 precision, the model consumes roughly 5.2 GB of VRAM for weights, plus an additional 1–2 GB for runtime buffers (audio codec, attention cache, temporary tensors). The peak instantaneous memory during a streaming forward pass is about 8–10 GB. This means an NVIDIA RTX 3090 or RTX 4090 (24 GB VRAM) runs it comfortably at full precision. An M4 Max with 48 GB unified memory or an M3 Ultra will also run it without issues.
For users with less VRAM, quantization is effective:
The model is not yet widely available on Ollama, but you can run it using Kyutai’s own delayed-streams-modeling repository, which provides Python scripts and stt-rs for Rust-based inference. The Hugging Face transformers integration from version 4.53.0 also supports the model natively via kyutai/stt-2.6b-en-trfs, offering a familiar API for PyTorch users.
Exact tokens-per-second (TPS) metrics are not published, but you can derive a reasonable estimate: the model processes 400 audio tokens per second of real-time audio. For each audio token, it generates one text token at a time. Given that the model has 2.6B parameters and operates on a transformer decoder, inference speed is primarily limited by memory bandwidth. On an RTX 4090 with full fp16, expect to process real-time audio at roughly 2–3× real-time (i.e., a 10-second audio clip transcribes in 3–5 seconds wall-clock). With Q4_K_M on an RTX 4060, expect 1–2× real-time — still usable for live captioning.
For batch processing, the model shines: the H100 figure of 400 concurrent streams suggests that any modern GPU with enough VRAM can handle dozens of parallel streams simultaneously. This is a key advantage over offline models that would need to queue and segment audio.
| GPU | VRAM | Precision/Quantization | Expected Use |
|---|---|---|---|
| RTX 4090, RTX 3090 | 24 GB | fp16 | Single-stream with headroom |
| RTX 4070 Ti, RTX 4080 | 16 GB | Q8_0 or Q4_K_M | Good for single-stream; batching limited |
| RTX 3060, RTX 4060 | 12 GB | Q4_K_M | Single-stream real-time; no batching |
| Apple M4 Max (64GB) | 64 GB | fp16 | Single-stream with high throughput |
| Apple M2 (16 GB) | 16 GB | Q4_K_M | Real-time transcription |
For the quickest local setup, use the Hugging Face transformers pipeline with the stt-2.6b-en-trfs checkpoint. If you need maximum performance and batching, the Rust inference (stt-rs) from the delayed-streams-modeling repo is faster and more memory-efficient.
Whisper large-v3 is a non-streaming encoder-decoder model. It achieves ~8–9% WER on English (slightly better with prompt tuning). Kyutai STT 2.6B EN achieves 6.4% WER while streaming — a significant accuracy advantage. However, Whisper can be run offline with no delay, whereas Kyutai STT has a fixed 2.5-second latency. Whisper also supports 100+ languages; Kyutai STT 2.6B EN is English-only.
Choose Kyutai STT 2.6B EN if you need streaming, lower latency than Whisper’s full-audio approach, and higher accuracy for English. Choose Whisper if you need multilingual support or work with offline audio where latency is irrelevant.
Kyutai’s own 1B model trades accuracy for latency. The 1B model has a 0.5-second delay (5x faster) but higher WER (likely ~8–9% based on typical scaling). It also supports French. The 2.6B model is strictly better for accuracy and English-only; the 1B is better for interactive voice agents that need near-instant feedback or for multilingual use.
Parakeet-CTC is a faster, streaming-compatible model using CTC loss. It is more memory efficient (600M params) and can run on CPUs, but its WER is higher (double-digit). Kyutai STT 2.6B EN is in a different class: it offers near-offline accuracy in a streaming form factor, making it suitable for production-grade transcription where quality cannot be compromised.
In summary, Kyutai STT 2.6B EN occupies a unique niche: it is the most accurate streaming English ASR model that fits on a single consumer GPU. For developers building voice applications on local hardware, it is the current best option — provided the 2.5-second latency is acceptable for the use case.