Alibaba

Qwen3-ASR-1.7B

Alibaba Qwen's flagship 1.7B-parameter ASR model supporting 52 languages and dialects, achieving SOTA performance among open-source ASR models and competitive with top proprietary APIs.

1.7B paramsDense

View on Hugging Face Source Code Official Page

Our Take

Best for: Open-source asr workloads

A strong 1.7B-parameter dense audio model from Alibaba. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters1.7B

ArchitectureDense

ProviderAlibaba

Download Size4.7 GB

Community

Monthly Downloads1.5M

Likes924

Last Updated5 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

WER

5.8%

MBA Open Score

72.4AA

Benchmark40%

88.5

Popularity25%

89.6

Efficiency25%

30.4

Versatility10%

70.0

Hardware That Runs This Model

The top devices for this model at 4-bit, ranked by fit and speed.

Device	Grade	VRAM
ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	1.5 GB
Acer Veriton GN100 AI MiniAcer	SS	1.5 GB
AMD Instinct MI300XAMD	SS	1.5 GB
AMD Instinct MI325XAMD	SS	1.5 GB
AMD Instinct MI355XAMD	SS	1.5 GB

See All 102 Compatible Devices

Rent in the Cloud

Cheapest current cloud rentals with at least 2 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA GeForce RTX 5060 TiVast.ai · Spot · 16 GB VRAM	$0.09
NVIDIA GeForce RTX 5060 TiVast.ai · On-Demand · 16 GB VRAM	$0.09
NVIDIA GeForce RTX 5070Vast.ai · Spot · 12 GB VRAM	$0.09
NVIDIA GeForce RTX 5070Vast.ai · On-Demand · 12 GB VRAM	$0.10
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

Alibaba’s Qwen3-ASR-1.7B is a dense, 1.7-billion-parameter automatic speech recognition model that sets a new high-water mark for open-source ASR. It supports language identification and transcription across 30 languages and 22 Chinese dialects—52 total language variants—making it one of the most multilingual open-source ASR models available. The Qwen team at Alibaba Cloud built it on top of the Qwen3-Omni audio understanding foundation, then trained on large-scale speech data to achieve performance that, per their own benchmarks, matches or exceeds top proprietary APIs from cloud providers.

For practitioners evaluating local AI models, the 1.7B parameter count places Qwen3-ASR in a sweet spot: small enough to run on a single consumer GPU with quantization, yet large enough to rival cloud-grade accuracy. It is released under Apache 2.0, so there are no use restrictions, and the companion inference toolkit (supporting streaming, vLLM batch inference, and async serving) makes it a serious contender for production deployments—not just a research toy.

Architecture & Technical Details

Qwen3-ASR-1.7B uses a two-stage pipeline: an audio encoder (AuT) that downsamples 16 kHz WAV or mel spectrograms through three Conv2D layers, then passes through a 24-layer transformer encoder with 16 attention heads, a model dimension of 1024, and an FFN dimension of 4096. The encoder output (2048-dimensional) is projected into a standard Qwen3 decoder—28 layers, hidden size 2048, 16 attention heads with 8 KV heads, and a vocabulary of 151,936 tokens.

This dense architecture means all 1.7 billion parameters are active during inference. Unlike mixture-of-experts models where only a subset of parameters activate per token, Qwen3-ASR’s decoder requires full model weights to be loaded. The tradeoff: you get consistent, deterministic inference quality across all languages and accents, at the cost of higher VRAM usage per active parameter. For local deployment, this favors GPUs with ample memory or aggressive quantization.

The decoder uses Q/K norms and Multi-Resolution Rotary Position Embedding (MRoPE) to handle variable-length audio inputs. Context length is not specified, but the model supports long audio chunking natively through the streaming pipeline—it can transcribe arbitrary-duration recordings without hitting a fixed context window limit.

Capabilities & Use Cases

Qwen3-ASR-1.7B is a text-only modality model: it takes audio and outputs text. Its primary capability is speech-to-text with built-in language identification. The trained languages include Chinese, English, Cantonese, Arabic, German, French, Spanish, Portuguese, Indonesian, Italian, Korean, Russian, Thai, Vietnamese, Japanese, Turkish, Hindi, Malay, Dutch, Swedish, Danish, Finnish, Polish, Czech, Filipino, Persian, Greek, Hungarian, Macedonian, and Romanian. The 22 Chinese dialects cover Anhui, Dongbei, Fujian, Gansu, Guizhou, Hebei, Henan, Hubei, Hunan, Jiangxi, and others.

Use cases that benefit specifically from this model:

Multilingual call center transcription: One model handles 52 language variants with zero switching overhead, and it identifies the language automatically per utterance.
Real-time captioning for live streams or meetings: The streaming inference mode yields a time-to-first-token as low as 92 ms (on the 0.6B sibling; the 1.7B will be slightly higher but still sub-second), with low-latency output.
Offline batch transcription of long audio: The vLLM backend can process large audio files efficiently, and the model can be deployed on a single GPU for on-premise compliance.
Singing voice recognition: The internal evaluations show robust performance on music/song audio, a differentiator from many ASR models that degrade on non-speech vocal content.
Forced alignment (companion model): Qwen3-ForcedAligner-0.6B, a separate non-autoregressive timestamp predictor, can align text–speech pairs in 11 languages with sub-second accuracy—useful for subtitle timing or phonetic analysis.

Running Qwen3-ASR-1.7B Locally

To run Qwen3-ASR-1.7B on your own hardware, you need to consider quantization, VRAM, and GPU generation. The model is available on Hugging Face (Qwen/Qwen3-ASR-1.7B) and supports the standard inference toolchain including transformers, vLLM, and an official Python inference script.

VRAM requirements by quantization

Quantization	Minimum VRAM	Recommended VRAM	Typical hardware
FP16 (full)	~3.6 GB	6+ GB	RTX 3060 12GB, M4 Max, RTX 4090, A6000
Q4_K_M (GGUF)	~1.2 GB	2 GB	RTX 3060 12GB, M4 Pro, Steam Deck (limited)
Q8_0 (GGUF)	~2.0 GB	3 GB	RTX 3060 12GB, RTX 4060
AWQ (4-bit)	~1.0 GB	2 GB	Same as Q4_K_M, slightly better performance

A practical rule: Q4_K_M quantization is the default recommendation for most users. It drops accuracy by roughly 1–2% WER on benchmark tests but cuts memory in half and speeds up decode on memory-bandwidth-constrained GPUs.

Consumer GPU performance estimates

Performance numbers depend heavily on audio length, batch size, and quantization.

RTX 4090 (24GB): At FP16, you can expect around 200–300 tokens per second on short utterances (under 10 seconds) with batch size 1. This translates to transcribing roughly 100–150 seconds of audio per second of wall time. With Q4_K_M, throughput can exceed 500 tokens per second.
RTX 3060 12GB: The 12GB VRAM is enough for FP16 (just barely, with overhead) but Q4_K_M is safer. Expect 100–150 tokens per second.
M4 Max (128GB unified memory): FP16 weights load cleanly. Throughput is similar to an RTX 4090 but with lower memory contention due to unified architecture.
Apple M4 Pro / M3 Pro: Use Q4_K_M to fit comfortably. Expect 80–120 tokens per second.
Steam Deck or low-power machines: Only feasible with Q4_K_M and small batch sizes. Expect 30–50 tokens per second—usable for short clips but painful for long audio.

Quickest start

The fastest way to get Qwen3-ASR-1.7B running locally is via Ollama (if a GGUF conversion is available) or directly with the official qwen_asr Python package. Example using the inference script:

1pip install qwen_asr
2python -m qwen_asr.transcribe --audio /path/to/audio.wav --model Qwen/Qwen3-ASR-1.7B

For production, the vLLM backend provides robust streaming and batching. The official GitHub repository (QwenLM/Qwen3-ASR) includes Docker images and deployment examples.

How It Compares

In the open-source ASR landscape, the two closest competitors at similar parameter counts are Whisper large-v3 (1.5B parameters) and SeamlessM4T-v2 (2.3B parameters, but includes translation). Here’s how Qwen3-ASR-1.7B stacks up:

Whisper large-v3: Whisper supports roughly 100 languages, but its performance on Chinese dialects and noisy environments lags behind Qwen3-ASR, especially on the internal benchmarks. Qwen3-ASR also offers native streaming and forced alignment, which Whisper does not. Whisper’s encoder is fixed at a 30-second window; Qwen3-ASR handles arbitrary-length audio without chunking limitations.
SeamlessM4T-v2: Larger and slower to run locally. Its strength is speech-to-speech translation, not pure ASR. If you only need transcription, Qwen3-ASR is more efficient and delivers better WER on multilingual benchmarks.
OpenAI Whisper API (proprietary): Qwen3-ASR-1.7B is competitive with the proprietary API on common benchmarks, but the corner-case performance on very noisy audio or rare dialects still often favors the API due to larger training data. For local, low-latency, cost-free deployment, Qwen3-ASR is the clear winner.

Choose Qwen3-ASR-1.7B when you need a single model that covers 52 languages/dialects with streaming, requires no cloud dependency, and can be quantized to run on a mid-range consumer GPU. If your workload is strictly English-only and you already have a Whisper pipeline, the gap is smaller—but Qwen3-ASR’s integrated language identification and forced alignment options often make it worth the switch.

Related Models

Alibaba

Qwen3-ASR-0.6B

0.6BDense

Alibaba

CosyVoice 2.0

0.5BDense

Explore the Provider

See all Alibaba models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Alibaba model we track.

Open Alibaba

Explore the Family

See every Qwen release

The full Qwen family leaderboard with sizes, benchmark scores, and a release timeline.

Open Qwen

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

Alibaba

Qwen3-ASR-1.7B

Alibaba Qwen's flagship 1.7B-parameter ASR model supporting 52 languages and dialects, achieving SOTA performance among open-source ASR models and competitive with top proprietary APIs.

1.7B paramsDense

View on Hugging Face Source Code Official Page

Our Take

Best for: Open-source asr workloads

A strong 1.7B-parameter dense audio model from Alibaba. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters1.7B

ArchitectureDense

ProviderAlibaba

Download Size4.7 GB

Community

Monthly Downloads1.5M

Likes924

Last Updated5 months ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

WER

5.8%

MBA Open Score

72.4AA

Benchmark40%

88.5

Popularity25%

89.6

Efficiency25%

30.4

Versatility10%

70.0

Hardware That Runs This Model

The top devices for this model at 4-bit, ranked by fit and speed.

Device	Grade	VRAM
ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	1.5 GB
Acer Veriton GN100 AI MiniAcer	SS	1.5 GB
AMD Instinct MI300XAMD	SS	1.5 GB
AMD Instinct MI325XAMD	SS	1.5 GB
AMD Instinct MI355XAMD	SS	1.5 GB

See All 102 Compatible Devices

Rent in the Cloud

Cheapest current cloud rentals with at least 2 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA GeForce RTX 5060 TiVast.ai · Spot · 16 GB VRAM	$0.09
NVIDIA GeForce RTX 5060 TiVast.ai · On-Demand · 16 GB VRAM	$0.09
NVIDIA GeForce RTX 5070Vast.ai · Spot · 12 GB VRAM	$0.09
NVIDIA GeForce RTX 5070Vast.ai · On-Demand · 12 GB VRAM	$0.10
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

Architecture & Technical Details

Capabilities & Use Cases

Use cases that benefit specifically from this model:

Multilingual call center transcription: One model handles 52 language variants with zero switching overhead, and it identifies the language automatically per utterance.
Real-time captioning for live streams or meetings: The streaming inference mode yields a time-to-first-token as low as 92 ms (on the 0.6B sibling; the 1.7B will be slightly higher but still sub-second), with low-latency output.
Offline batch transcription of long audio: The vLLM backend can process large audio files efficiently, and the model can be deployed on a single GPU for on-premise compliance.
Singing voice recognition: The internal evaluations show robust performance on music/song audio, a differentiator from many ASR models that degrade on non-speech vocal content.
Forced alignment (companion model): Qwen3-ForcedAligner-0.6B, a separate non-autoregressive timestamp predictor, can align text–speech pairs in 11 languages with sub-second accuracy—useful for subtitle timing or phonetic analysis.

Running Qwen3-ASR-1.7B Locally

VRAM requirements by quantization

Quantization	Minimum VRAM	Recommended VRAM	Typical hardware
FP16 (full)	~3.6 GB	6+ GB	RTX 3060 12GB, M4 Max, RTX 4090, A6000
Q4_K_M (GGUF)	~1.2 GB	2 GB	RTX 3060 12GB, M4 Pro, Steam Deck (limited)
Q8_0 (GGUF)	~2.0 GB	3 GB	RTX 3060 12GB, RTX 4060
AWQ (4-bit)	~1.0 GB	2 GB	Same as Q4_K_M, slightly better performance

Consumer GPU performance estimates

Performance numbers depend heavily on audio length, batch size, and quantization.

RTX 4090 (24GB): At FP16, you can expect around 200–300 tokens per second on short utterances (under 10 seconds) with batch size 1. This translates to transcribing roughly 100–150 seconds of audio per second of wall time. With Q4_K_M, throughput can exceed 500 tokens per second.
RTX 3060 12GB: The 12GB VRAM is enough for FP16 (just barely, with overhead) but Q4_K_M is safer. Expect 100–150 tokens per second.
M4 Max (128GB unified memory): FP16 weights load cleanly. Throughput is similar to an RTX 4090 but with lower memory contention due to unified architecture.
Apple M4 Pro / M3 Pro: Use Q4_K_M to fit comfortably. Expect 80–120 tokens per second.
Steam Deck or low-power machines: Only feasible with Q4_K_M and small batch sizes. Expect 30–50 tokens per second—usable for short clips but painful for long audio.

Quickest start

1pip install qwen_asr
2python -m qwen_asr.transcribe --audio /path/to/audio.wav --model Qwen/Qwen3-ASR-1.7B

For production, the vLLM backend provides robust streaming and batching. The official GitHub repository (QwenLM/Qwen3-ASR) includes Docker images and deployment examples.

How It Compares

Whisper large-v3: Whisper supports roughly 100 languages, but its performance on Chinese dialects and noisy environments lags behind Qwen3-ASR, especially on the internal benchmarks. Qwen3-ASR also offers native streaming and forced alignment, which Whisper does not. Whisper’s encoder is fixed at a 30-second window; Qwen3-ASR handles arbitrary-length audio without chunking limitations.
SeamlessM4T-v2: Larger and slower to run locally. Its strength is speech-to-speech translation, not pure ASR. If you only need transcription, Qwen3-ASR is more efficient and delivers better WER on multilingual benchmarks.
OpenAI Whisper API (proprietary): Qwen3-ASR-1.7B is competitive with the proprietary API on common benchmarks, but the corner-case performance on very noisy audio or rare dialects still often favors the API due to larger training data. For local, low-latency, cost-free deployment, Qwen3-ASR is the clear winner.