Alibaba Qwen's compact 0.6B-parameter all-in-one multilingual ASR model supporting 52 languages and dialects, built on the Qwen3-Omni audio foundation model. Optimized for ultra-low latency (~92ms TTFT) and on-device deployment.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Qwen3-ASR-0.6B is a lightweight automatic speech recognition model from Alibaba's Qwen team, released alongside the larger 1.7B variant. It is post-trained from the Qwen3-Omni audio-language foundation model and adopts a Large Audio-Language Model (LALM) paradigm: an audio encoder (AuT) produces acoustic features, a projector maps them into text-embedding space, and a Qwen3-based transformer decoder autoregressively emits transcriptions.
Based on Qwen3-Omni, post-trained on large-scale speech-text pairs (exact dataset mix undisclosed). Audio input is resampled to 16 kHz mel features.
Real-time dictation, edge/on-device transcription, subtitling, voice agents, call-center analytics, and pipelines needing a strong accuracy/efficiency trade-off.