
Alibaba Qwen's flagship 1.7B-parameter ASR model supporting 52 languages and dialects, achieving SOTA performance among open-source ASR models and competitive with top proprietary APIs.
A strong 1.7B-parameter dense audio model from Alibaba. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Cheapest current cloud rentals with at least 2 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA GeForce RTX 3060 TiVast.ai · Spot · 8 GB VRAM | $0.04 |
NVIDIA GeForce RTX 3060 TiVast.ai · On-Demand · 8 GB VRAM | $0.04 |
Alibaba’s Qwen3-ASR-1.7B is a dense, 1.7-billion-parameter automatic speech recognition model that sets a new high-water mark for open-source ASR. It supports language identification and transcription across 30 languages and 22 Chinese dialects—52 total language variants—making it one of the most multilingual open-source ASR models available. The Qwen team at Alibaba Cloud built it on top of the Qwen3-Omni audio understanding foundation, then trained on large-scale speech data to achieve performance that, per their own benchmarks, matches or exceeds top proprietary APIs from cloud providers.
For practitioners evaluating local AI models, the 1.7B parameter count places Qwen3-ASR in a sweet spot: small enough to run on a single consumer GPU with quantization, yet large enough to rival cloud-grade accuracy. It is released under Apache 2.0, so there are no use restrictions, and the companion inference toolkit (supporting streaming, vLLM batch inference, and async serving) makes it a serious contender for production deployments—not just a research toy.

Explore the Provider
Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Alibaba model we track.

Explore the Family
The full Qwen family leaderboard with sizes, benchmark scores, and a release timeline.
Qwen3-ASR-1.7B uses a two-stage pipeline: an audio encoder (AuT) that downsamples 16 kHz WAV or mel spectrograms through three Conv2D layers, then passes through a 24-layer transformer encoder with 16 attention heads, a model dimension of 1024, and an FFN dimension of 4096. The encoder output (2048-dimensional) is projected into a standard Qwen3 decoder—28 layers, hidden size 2048, 16 attention heads with 8 KV heads, and a vocabulary of 151,936 tokens.
This dense architecture means all 1.7 billion parameters are active during inference. Unlike mixture-of-experts models where only a subset of parameters activate per token, Qwen3-ASR’s decoder requires full model weights to be loaded. The tradeoff: you get consistent, deterministic inference quality across all languages and accents, at the cost of higher VRAM usage per active parameter. For local deployment, this favors GPUs with ample memory or aggressive quantization.
The decoder uses Q/K norms and Multi-Resolution Rotary Position Embedding (MRoPE) to handle variable-length audio inputs. Context length is not specified, but the model supports long audio chunking natively through the streaming pipeline—it can transcribe arbitrary-duration recordings without hitting a fixed context window limit.
Qwen3-ASR-1.7B is a text-only modality model: it takes audio and outputs text. Its primary capability is speech-to-text with built-in language identification. The trained languages include Chinese, English, Cantonese, Arabic, German, French, Spanish, Portuguese, Indonesian, Italian, Korean, Russian, Thai, Vietnamese, Japanese, Turkish, Hindi, Malay, Dutch, Swedish, Danish, Finnish, Polish, Czech, Filipino, Persian, Greek, Hungarian, Macedonian, and Romanian. The 22 Chinese dialects cover Anhui, Dongbei, Fujian, Gansu, Guizhou, Hebei, Henan, Hubei, Hunan, Jiangxi, and others.
Use cases that benefit specifically from this model:
To run Qwen3-ASR-1.7B on your own hardware, you need to consider quantization, VRAM, and GPU generation. The model is available on Hugging Face (Qwen/Qwen3-ASR-1.7B) and supports the standard inference toolchain including transformers, vLLM, and an official Python inference script.
| Quantization | Minimum VRAM | Recommended VRAM | Typical hardware |
|---|---|---|---|
| FP16 (full) | ~3.6 GB | 6+ GB | RTX 3060 12GB, M4 Max, RTX 4090, A6000 |
| Q4_K_M (GGUF) | ~1.2 GB | 2 GB | RTX 3060 12GB, M4 Pro, Steam Deck (limited) |
| Q8_0 (GGUF) | ~2.0 GB | 3 GB | RTX 3060 12GB, RTX 4060 |
| AWQ (4-bit) | ~1.0 GB | 2 GB | Same as Q4_K_M, slightly better performance |
A practical rule: Q4_K_M quantization is the default recommendation for most users. It drops accuracy by roughly 1–2% WER on benchmark tests but cuts memory in half and speeds up decode on memory-bandwidth-constrained GPUs.
Performance numbers depend heavily on audio length, batch size, and quantization.
The fastest way to get Qwen3-ASR-1.7B running locally is via Ollama (if a GGUF conversion is available) or directly with the official qwen_asr Python package. Example using the inference script:
1pip install qwen_asr2python -m qwen_asr.transcribe --audio /path/to/audio.wav --model Qwen/Qwen3-ASR-1.7B
For production, the vLLM backend provides robust streaming and batching. The official GitHub repository (QwenLM/Qwen3-ASR) includes Docker images and deployment examples.
In the open-source ASR landscape, the two closest competitors at similar parameter counts are Whisper large-v3 (1.5B parameters) and SeamlessM4T-v2 (2.3B parameters, but includes translation). Here’s how Qwen3-ASR-1.7B stacks up:
Choose Qwen3-ASR-1.7B when you need a single model that covers 52 languages/dialects with streaming, requires no cloud dependency, and can be quantized to run on a mid-range consumer GPU. If your workload is strictly English-only and you already have a Whisper pipeline, the gap is smaller—but Qwen3-ASR’s integrated language identification and forced alignment options often make it worth the switch.
| $0.04 |
NVIDIA GeForce RTX 3070Vast.ai · On-Demand · 8 GB VRAM | $0.04 |
NVIDIA GeForce RTX 5060 TiVast.ai · Spot · 16 GB VRAM | $0.08 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.