Alibaba

Qwen3-ASR-0.6B

Alibaba Qwen's compact 0.6B-parameter all-in-one multilingual ASR model supporting 52 languages and dialects, built on the Qwen3-Omni audio foundation model. Optimized for ultra-low latency (~92ms TTFT) and on-device deployment.

0.6B paramsDense

View on Hugging Face Source Code Official Page

Our Take

Best for: Open-source asr workloads

A strong 0.6B-parameter dense audio model from Alibaba. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters0.6B

ArchitectureDense

ProviderAlibaba

Download Size1.9 GB

Community

Monthly Downloads910.3K

Likes307

Last Updated4 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

WER

6.4%

MBA Open Score

78.7AA

Benchmark40%

87.2

Popularity25%

77.8

Efficiency25%

69.6

Versatility10%

70.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	0.9 GB
Acer Veriton GN100 AI MiniAcer	SS	0.9 GB
AMD Instinct MI300XAMD	SS	0.9 GB
AMD Instinct MI325XAMD	SS	0.9 GB
AMD Instinct MI355XAMD	SS	0.9 GB
AMD Radeon RX 7600 8GBAMD	SS	0.9 GB
AMD Radeon RX 7700 XTAMD	SS	0.9 GB
AMD Radeon RX 7800 XTAMD	SS	0.9 GB
AMD Radeon RX 7900 XTAMD	SS	0.9 GB
AMD Radeon RX 7900 XTXAMD	SS	0.9 GB
AMD Radeon RX 9070AMD	SS	0.9 GB
AMD Radeon RX 9070 XTAMD	SS	0.9 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	0.9 GB
Apple M4Apple	SS	0.9 GB
Apple M4 Max (40-core GPU)Apple	SS	0.9 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	0.9 GB
Apple M5Apple	SS	0.9 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	0.9 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	0.9 GB
Apple Mac Mini (M1, 2020)Apple	SS	0.9 GB
Apple Mac Mini (M2, 2023)Apple	SS	0.9 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	0.9 GB
Apple Mac Mini (M4, 2024)Apple	SS	0.9 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	0.9 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	0.9 GB

Rows per page

Page 1 of 5

Rent in the Cloud

Cheapest current cloud rentals with at least 1 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA L4Vast.ai · Spot · 24 GB VRAM	$0.03
NVIDIA L4Vast.ai · On-Demand · 24 GB VRAM	$0.04
NVIDIA GeForce RTX 5060 TiVast.ai · Spot · 16 GB VRAM	$0.09
NVIDIA GeForce RTX 5060 TiVast.ai · On-Demand · 16 GB VRAM	$0.10
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

Alibaba’s Qwen3-ASR-0.6B is a compact, all-in-one automatic speech recognition model that packs multilingual support for 52 languages and dialects into just 0.6 billion parameters. It is the smaller sibling in the Qwen3-ASR family, built on the audio understanding foundation of Qwen3-Omni. Unlike many ASR models that require separate language detection or post-processing, this model handles language identification and transcription in a single forward pass — with streaming and offline inference unified in one architecture.

The 0.6B version is purpose-built for on-device and edge deployment where latency matters more than raw accuracy. It achieves an average time-to-first-token (TTFT) of 92ms and can transcribe 2,000 seconds of audio in one second of wall-clock time at a concurrency of 128 on server-class hardware. For practitioners who need to run ASR locally without cloud dependencies, this model offers the best accuracy-efficiency trade-off in its class. Licensed under Apache 2.0, it is free for commercial use.

Architecture & Technical Details

Qwen3-ASR-0.6B uses a dense architecture — no mixture-of-experts. That means all 0.6B parameters are active for every inference. The tradeoff is straightforward: lower memory overhead than MoE models of similar total parameter count (since there’s no unused expert path), and deterministic latency. You get predictable VRAM consumption and consistent throughput.

The model processes audio through a speech encoder (part of the Qwen3-Omni pipeline) and outputs text. It does not require a separate language classifier — language identification is integrated. It supports both chunked streaming and full-utterance offline modes from the same weights. The exact context length is not specified, but the model is designed to handle long audio via chunked processing; the companion forced-alignment model supports up to 5-minute segments across 11 languages.

Key architectural characteristics:

Dense – 0.6B parameters all active
Modality – Audio input, text output (ASR + language ID)
Inference modes – Streaming (chunk-by-chunk) and offline (full audio)
Latency – ~92ms TTFT average (streaming, on server-class hardware)
Throughput – ~2000x real-time at concurrency 128 (batch inference on GPU)

Capabilities & Use Cases

Qwen3-ASR-0.6B is designed to be dropped into production speech pipelines with minimal integration overhead. Its core capabilities:

Language identification for 52 languages and dialects, including 30 languages (e.g., Chinese, English, Japanese, Arabic, German, French, Spanish, Portuguese, Hindi, Vietnamese, Thai, Korean, Russian, Turkish, Indonesian, Malay, Dutch, Swedish, Danish, Finnish, Polish, Czech, Filipino, Persian, Greek, Hungarian, Macedonian, Romanian) plus 22 Chinese dialects (e.g., Cantonese, Shanghainese, Sichuanese, Fujian, and more). English accents from multiple regions are also covered.
Speech-to-text with streaming output – ideal for live captioning, voice assistants, or real-time transcription.
Long audio transcription – can process hours of audio by chaining chunks.
Force alignment support via the separate Qwen3-ForcedAligner-0.6B model (sold separately, but part of the same ecosystem) for word-level or phoneme-level timestamp prediction.

Concrete use cases:

On-device voice commands – run on a Raspberry Pi with an NPU or a laptop with 4GB RAM using quantization.
Low-latency multilingual transcription for call centers or live events (92ms TTFT means near-instant first word).
Batch transcription of recorded meetings or podcasts using consumer GPUs (RTX 3060 and up).
Embedded systems – the 0.6B size allows inference on many edge devices with a small GPU or accelerator.

Running Qwen3-ASR-0.6B Locally

This is where Qwen3-ASR-0.6B shines. Its small footprint makes it accessible on hardware that can’t touch larger models.

VRAM Requirements

Quantization	Estimated VRAM	Realistic Hardware
FP16 (full precision)	~1.2 GB	Any GPU with 2GB+ VRAM
Q8_0 (8-bit)	~0.7 GB	Raspberry Pi 5 (no GPU), CPU inference
Q4_K_M (recommended)	~0.5 GB	Any modern GPU, integrated graphics
Q4_0	~0.4 GB	Extremely memory-constrained devices

For most users, Q4_K_M strikes the best balance — <0.5 GB VRAM, negligible quality degradation, and fast inference even on integrated GPUs.

Hardware That Works

Consumer GPUs: RTX 3060 (12GB) can run the model at FP16 with headroom for audio preprocessing. RTX 4090 or 4080 will handle high-concurrency streaming (multiple channels) easily.
Apple Silicon: MacBook Air M1 (8GB unified memory) can run Q4_K_M comfortably. M4 Max can do FP16 for highest quality.
CPU-Only: With Q4_K_M, you can transcribe in near-real-time on a modern x86 CPU (e.g., Ryzen 7, Intel i7-13700). Expect ~0.5x real-time factor.
Edge Devices: NVIDIA Jetson Orin Nano (8GB), Raspberry Pi 5 (4GB) with Q4_0 can handle low-latency single-channel streaming.

Expected Performance (Tokens per Second)

Performance depends on audio length, streaming vs. batch, and quantization. On an RTX 4090 at Q4_K_M, expect:

Streaming mode: ~200-400 real-time factor (transcribe 1 second of audio in ~2-5ms)
Batch offline: up to 2000x (128 concurrent streams) as reported.

For a single stream on a mid-range GPU (RTX 3060, Q4_K_M), you get real-time factor ~10-30x, meaning 5 seconds of audio processed in ~0.5 seconds.

Getting Started

The fastest way to run Qwen3-ASR-0.6B locally is through Ollama. The model is available in the Ollama library as qwen3-asr:0.6b. Command:

1ollama pull qwen3-asr:0.6b

For custom deployment, Alibaba provides an inference toolkit on GitHub with vLLM backend, streaming, and Gradio demos. You can also use Hugging Face transformers with the AutoModel pipeline.

How It Compares

Qwen3-ASR-0.6B sits at the intersection of size and capability. Its main competition are other small ASR models:

Whisper Small (244M): Smaller and faster, but Whisper is English-centric and requires separate language detection. Qwen3-ASR-0.6B natively handles 52 languages with integrated language ID. Whisper Small runs on less memory (~500 MB FP16) but lacks the dual-mode streaming/offline flexibility and the Chinese dialect support.
SenseVoice Small (0.6B): A similar-sized dense ASR model from Alibaba’s rival. SenseVoice supports roughly 50 languages but has less extensive Chinese dialect coverage. Qwen3-ASR-0.6B matches it in size and often outperforms on internal benchmarks for accuracy under noise and accented speech.

When to choose Qwen3-ASR-0.6B: You need a single model to handle multilingual ASR with language ID, streaming, and batch workflows on memory-constrained hardware. You want Apache 2.0 licensing without restrictions. You need strong Chinese dialect support.

When to look elsewhere: If you only transcribe English and your hardware is extremely limited (e.g., 256MB RAM), Whisper Tiny (39M) is smaller. If you need the absolute highest accuracy and have a powerful server, use the 1.7B variant or a commercial API.

Related Models

Alibaba

Qwen3-ASR-1.7B

1.7BDense

Alibaba

CosyVoice 2.0

0.5BDense

Explore the Provider

See all Alibaba models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Alibaba model we track.

Open Alibaba

Explore the Family

See every Qwen release

The full Qwen family leaderboard with sizes, benchmark scores, and a release timeline.

Open Qwen

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

Alibaba

Qwen3-ASR-0.6B

0.6B paramsDense

View on Hugging Face Source Code Official Page

Our Take

Best for: Open-source asr workloads

A strong 0.6B-parameter dense audio model from Alibaba. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters0.6B

ArchitectureDense

ProviderAlibaba

Download Size1.9 GB

Community

Monthly Downloads910.3K

Likes307

Last Updated4 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

WER

6.4%

MBA Open Score

78.7AA

Benchmark40%

87.2

Popularity25%

77.8

Efficiency25%

69.6

Versatility10%

70.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	0.9 GB
Acer Veriton GN100 AI MiniAcer	SS	0.9 GB
AMD Instinct MI300XAMD	SS	0.9 GB
AMD Instinct MI325XAMD	SS	0.9 GB
AMD Instinct MI355XAMD	SS	0.9 GB
AMD Radeon RX 7600 8GBAMD	SS	0.9 GB
AMD Radeon RX 7700 XTAMD	SS	0.9 GB
AMD Radeon RX 7800 XTAMD	SS	0.9 GB
AMD Radeon RX 7900 XTAMD	SS	0.9 GB
AMD Radeon RX 7900 XTXAMD	SS	0.9 GB
AMD Radeon RX 9070AMD	SS	0.9 GB
AMD Radeon RX 9070 XTAMD	SS	0.9 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	0.9 GB
Apple M4Apple	SS	0.9 GB
Apple M4 Max (40-core GPU)Apple	SS	0.9 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	0.9 GB
Apple M5Apple	SS	0.9 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	0.9 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	0.9 GB
Apple Mac Mini (M1, 2020)Apple	SS	0.9 GB
Apple Mac Mini (M2, 2023)Apple	SS	0.9 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	0.9 GB
Apple Mac Mini (M4, 2024)Apple	SS	0.9 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	0.9 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	0.9 GB

Rows per page

Page 1 of 5

Rent in the Cloud

Cheapest current cloud rentals with at least 1 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA L4Vast.ai · Spot · 24 GB VRAM	$0.03
NVIDIA L4Vast.ai · On-Demand · 24 GB VRAM	$0.04
NVIDIA GeForce RTX 5060 TiVast.ai · Spot · 16 GB VRAM	$0.09
NVIDIA GeForce RTX 5060 TiVast.ai · On-Demand · 16 GB VRAM	$0.10
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

Architecture & Technical Details

Key architectural characteristics:

Dense – 0.6B parameters all active
Modality – Audio input, text output (ASR + language ID)
Inference modes – Streaming (chunk-by-chunk) and offline (full audio)
Latency – ~92ms TTFT average (streaming, on server-class hardware)
Throughput – ~2000x real-time at concurrency 128 (batch inference on GPU)

Capabilities & Use Cases

Qwen3-ASR-0.6B is designed to be dropped into production speech pipelines with minimal integration overhead. Its core capabilities:

Language identification for 52 languages and dialects, including 30 languages (e.g., Chinese, English, Japanese, Arabic, German, French, Spanish, Portuguese, Hindi, Vietnamese, Thai, Korean, Russian, Turkish, Indonesian, Malay, Dutch, Swedish, Danish, Finnish, Polish, Czech, Filipino, Persian, Greek, Hungarian, Macedonian, Romanian) plus 22 Chinese dialects (e.g., Cantonese, Shanghainese, Sichuanese, Fujian, and more). English accents from multiple regions are also covered.
Speech-to-text with streaming output – ideal for live captioning, voice assistants, or real-time transcription.
Long audio transcription – can process hours of audio by chaining chunks.
Force alignment support via the separate Qwen3-ForcedAligner-0.6B model (sold separately, but part of the same ecosystem) for word-level or phoneme-level timestamp prediction.

Concrete use cases:

On-device voice commands – run on a Raspberry Pi with an NPU or a laptop with 4GB RAM using quantization.
Low-latency multilingual transcription for call centers or live events (92ms TTFT means near-instant first word).
Batch transcription of recorded meetings or podcasts using consumer GPUs (RTX 3060 and up).
Embedded systems – the 0.6B size allows inference on many edge devices with a small GPU or accelerator.

Running Qwen3-ASR-0.6B Locally

This is where Qwen3-ASR-0.6B shines. Its small footprint makes it accessible on hardware that can’t touch larger models.

VRAM Requirements

Quantization	Estimated VRAM	Realistic Hardware
FP16 (full precision)	~1.2 GB	Any GPU with 2GB+ VRAM
Q8_0 (8-bit)	~0.7 GB	Raspberry Pi 5 (no GPU), CPU inference
Q4_K_M (recommended)	~0.5 GB	Any modern GPU, integrated graphics
Q4_0	~0.4 GB	Extremely memory-constrained devices

For most users, Q4_K_M strikes the best balance — <0.5 GB VRAM, negligible quality degradation, and fast inference even on integrated GPUs.

Hardware That Works

Consumer GPUs: RTX 3060 (12GB) can run the model at FP16 with headroom for audio preprocessing. RTX 4090 or 4080 will handle high-concurrency streaming (multiple channels) easily.
Apple Silicon: MacBook Air M1 (8GB unified memory) can run Q4_K_M comfortably. M4 Max can do FP16 for highest quality.
CPU-Only: With Q4_K_M, you can transcribe in near-real-time on a modern x86 CPU (e.g., Ryzen 7, Intel i7-13700). Expect ~0.5x real-time factor.
Edge Devices: NVIDIA Jetson Orin Nano (8GB), Raspberry Pi 5 (4GB) with Q4_0 can handle low-latency single-channel streaming.

Expected Performance (Tokens per Second)

Performance depends on audio length, streaming vs. batch, and quantization. On an RTX 4090 at Q4_K_M, expect:

Streaming mode: ~200-400 real-time factor (transcribe 1 second of audio in ~2-5ms)
Batch offline: up to 2000x (128 concurrent streams) as reported.

For a single stream on a mid-range GPU (RTX 3060, Q4_K_M), you get real-time factor ~10-30x, meaning 5 seconds of audio processed in ~0.5 seconds.

Getting Started

The fastest way to run Qwen3-ASR-0.6B locally is through Ollama. The model is available in the Ollama library as qwen3-asr:0.6b. Command:

1ollama pull qwen3-asr:0.6b

For custom deployment, Alibaba provides an inference toolkit on GitHub with vLLM backend, streaming, and Gradio demos. You can also use Hugging Face transformers with the AutoModel pipeline.

How It Compares

Qwen3-ASR-0.6B sits at the intersection of size and capability. Its main competition are other small ASR models:

Whisper Small (244M): Smaller and faster, but Whisper is English-centric and requires separate language detection. Qwen3-ASR-0.6B natively handles 52 languages with integrated language ID. Whisper Small runs on less memory (~500 MB FP16) but lacks the dual-mode streaming/offline flexibility and the Chinese dialect support.
SenseVoice Small (0.6B): A similar-sized dense ASR model from Alibaba’s rival. SenseVoice supports roughly 50 languages but has less extensive Chinese dialect coverage. Qwen3-ASR-0.6B matches it in size and often outperforms on internal benchmarks for accuracy under noise and accented speech.