
A 245M-parameter streaming English ASR model from Useful Sensors designed for low-latency, on-device transcription. Uses an 'ergodic' sliding-window encoder and achieves better WER than Whisper Large v3 on the OpenASR Leaderboard at a fraction of the size.
A strong 0.245B-parameter dense audio model from Useful Sensors. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing. On the rise in download charts.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Cheapest current cloud rentals with at least 1 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM | $0.11 |
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 4090Vast.ai · Spot · 24 GB VRAM | $0.13 |
NVIDIA GeForce RTX 4090Vast.ai · On-Demand · 24 GB VRAM | $0.13 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
Moonshine Streaming Medium is a 245M-parameter streaming English automatic speech recognition (ASR) model from Useful Sensors, designed to bring low-latency, on-device transcription to edge hardware. The model achieves a word error rate of 6.65% on the OpenASR Leaderboard — outperforming Whisper Large v3 despite being roughly 6× smaller. It’s purpose-built for real-time voice applications where latency, privacy, and local execution matter more than cloud connectivity.
The model is part of the broader Moonshine Voice framework, an open-source toolkit (MIT license) from the team behind TensorFlow’s original founders. Moonshine Streaming Medium targets developers building voice agents, live captioning, smart assistants, and interactive systems that must respond while the user is still speaking. It’s a dense architecture with 0.245B parameters, optimized for streaming inference rather than batch processing.
Moonshine Streaming Medium uses a sequence-to-sequence Transformer with a novel “ergodic” sliding-window encoder. The encoder processes audio in overlapping chunks using bounded local attention and no positional embeddings — positional information is injected via an adapter before the autoregressive decoder. This design eliminates the need to wait for the full audio clip before starting transcription, enabling sub-100ms streaming latency on modest hardware.
Key architectural characteristics:
automatic-speech-recognition), plus native C API and pre-built packages for iOS, Android, Python, macOS, Windows, Linux, Raspberry Pi, and wearables.The context window is effectively unbounded for streaming: the model processes audio indefinitely by shifting the attention window forward. No context length is specified, but practical usage suggests it handles continuous streams of at least several minutes without degradation.
Moonshine Streaming Medium is built for one thing: converting English speech to text in real time on local hardware. It excels at:
The model is not suited for multilingual transcription, music/sound event detection, or offline batch processing of very long recordings (non-streaming models like Whisper may be more efficient for that).
The tiny parameter count makes Moonshine Streaming Medium extraordinarily accessible. At 0.245B parameters, even the FP32 weights consume roughly 1 GB of VRAM / system memory. Quantization drops this further:
| Quantization | Approximate VRAM | Typical Use Case |
|---|---|---|
| FP32 (unquantized) | ~1.0 GB | Maximum accuracy on desktop/laptop |
| FP16 | ~0.5 GB | Good quality, runs on most integrated GPUs |
| Q4_K_M (4-bit) | ~0.3 GB | Fits on smartphones, Raspberry Pi 4+ |
| Q8_0 (8-bit) | ~0.6 GB | Balanced for older hardware |
Recommended hardware:
Performance expectations: On an RTX 3090, the model achieves roughly 50–80 tokens per second for the decoder, while the encoder runs at ~10–20 ms per 30 ms audio window (0.3–0.6 real-time factor). Overall end-to-end latency from audio input to first word is typically below 100 ms on a mid-range GPU.
Quickest path to run locally: Install the Moonshine Python package via pip install moonshine (see [GitHub](https://github.com/usefulsensors/moonshine)). Alternatively, use Hugging Face Transformers with MoonshineStreamingForConditionalGeneration as shown in the model card. No API keys or cloud accounts are needed.
vs. Whisper Large v3 – Moonshine Medium beats Whisper Large v3 on the OpenASR Leaderboard (6.65% WER vs. ~7.4%) while being 30× smaller. Whisper is more robust to noisy environments and supports 99 languages; Moonshine is strictly English. For streaming, Moonshine’s sliding-window encoder provides lower latency and lower memory overhead. If you need multilingual speech recognition, Whisper remains the better choice.
vs. other edge ASR models (e.g., Paraformer-Large, Wav2Vec2-Large) – Moonshine Medium’s 245M parameters place it between small and medium edge ASR models. Paraformer-Large (~220M) offers competitive WER but is not designed for streaming. Moonshine’s explicit streaming architecture gives it a latency advantage. Wav2Vec2-Large (300M) requires fine-tuning for transcription (CTC-based) and lacks native streaming support; Moonshine is ready out of the box.
When to choose Moonshine Streaming Medium: You need real-time English transcription on a device with limited compute (Raspberry Pi, phone, low-end PC) and you want accuracy comparable to top cloud ASR APIs without sending data off-device. The MIT license and cross-platform support make it a low-friction option for production voice pipelines.

Explore the Provider
Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Useful Sensors model we track.

Explore the Family
The full Moonshine family leaderboard with sizes, benchmark scores, and a release timeline.