
Z.ai's compact 1.5B-parameter open-source ASR model from the GLM family, optimized for real-world conditions — including Chinese dialects (notably Cantonese) and whisper/quiet-speech — while outperforming Whisper V3 on several benchmarks.
A solid 1.5B-parameter dense audio model from Z.ai. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Cheapest current cloud rentals with at least 1 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA GeForce RTX 3080Vast.ai · Spot · 10 GB VRAM | $0.03 |
NVIDIA GeForce RTX 3080Vast.ai · On-Demand · 10 GB VRAM | $0.03 |
NVIDIA GeForce RTX 5060 TiVast.ai · Spot · 16 GB VRAM | $0.09 |
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM | $0.09 |
NVIDIA GeForce RTX 5060 TiVast.ai · On-Demand · 16 GB VRAM | $0.10 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
GLM-ASR-Nano-2512 is a compact, open-source automatic speech recognition (ASR) model from Z.ai, weighing in at 1.5 billion parameters. It belongs to the GLM family and targets a specific pain point: accurate transcription in real-world conditions that trip up larger, more general-purpose models. The model is a dense architecture—no mixture-of-experts tricks—meaning all 1.5B parameters are active during inference. That makes it straightforward to run on consumer hardware without the complexity of managing expert routing.
What sets this model apart is its focus on Chinese dialects (particularly Cantonese) and its ability to handle whisper-quiet speech—audio so low-volume that most ASR systems return silence or garbage. On standard benchmarks like Wenet Meeting (noisy, overlapping meeting speech) and Aishell-1 (Mandarin), it achieves the lowest average error rate (4.10) among comparable open-source models, outperforming OpenAI’s Whisper V3 despite being significantly smaller. Licensed under MIT, it’s free to use, modify, and deploy.
GLM-ASR-Nano-2512 uses a dense encoder-decoder architecture with 1.5B total parameters. Unlike MoE models where only a subset of parameters activate per token, this model loads and runs all parameters at inference time. The tradeoff is straightforward: you get predictable memory usage and consistent latency, but you pay the full VRAM cost for every forward pass.
Because Z.ai has not published a context length, we treat it as the default for the transformers integration—typically 2048 or 4096 tokens for audio inputs. In practice, the model processes audio frames directly through a processor that handles resampling and feature extraction, so the effective context is dictated by audio duration rather than text tokens. For most real-world use cases (meetings, interviews, short recordings), this is more than sufficient.
The model is designed to be run with Hugging Face transformers (from source, as of early 2025) and will soon support vLLM and SGLang for optimized inference. The processor includes a apply_transcription_request method that wraps audio preprocessing and chat template formatting, making integration trivial.
GLM-ASR-Nano-2512 excels in three specific areas:
Beyond these strengths, it supports 17 languages with WER ≤20%, including English, Japanese, Korean, and several European languages. However, its primary advantage is in Chinese-language tasks. Use it for voice assistants, dictation software, accessibility tools, and any application where local, low-latency ASR is required and cloud APIs are not an option.
This is where the model shines for practitioners. At 1.5B parameters, it fits comfortably on consumer GPUs without quantization. Here are the practical numbers:
Quickest way to start: Ollama. Pull the model (once available in the Ollama library) or convert the Hugging Face checkpoint to GGUF format and import it. Otherwise, use transformers from source as shown in the model card:
1from transformers import AutoModelForSeq2SeqLM, AutoProcessor23processor = AutoProcessor.from_pretrained("zai-org/GLM-ASR-Nano-2512")4model = AutoModelForSeq2SeqLM.from_pretrained("zai-org/GLM-ASR-Nano-2512", dtype="auto", device_map="auto")56inputs = processor.apply_transcription_request("audio.mp3")7outputs = model.generate(**inputs, do_sample=False, max_new_tokens=500)8print(processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True))
No cloud dependency. No API keys. Just local inference.
The most direct competitor is OpenAI Whisper V3 (1.5B variant). On standard English benchmarks, Whisper V3 holds a slight edge. But on Chinese dialects, quiet speech, and noisy meetings, GLM-ASR-Nano-2512 consistently wins. Whisper V3 also has a larger context window (224k tokens for audio) and supports 99 languages, but its performance on low-volume audio is mediocre. If you need broad multilingual coverage and can tolerate higher error rates on edge cases, Whisper V3 is the safer bet. If your primary language is Chinese (especially Cantonese) or you deal with whispered input, GLM-ASR-Nano-2512 is the better choice.
Another alternative is SenseVoice (by Alibaba), which also targets Chinese ASR but is larger (around 1.8B parameters) and less optimized for quiet speech. GLM-ASR-Nano-2512 matches or beats it on benchmarks while being easier to run locally due to its smaller size.
For English-only workloads with clean audio, neither model is ideal—use a dedicated English ASR model like Whisper small or Wav2Vec2. But for the niche this model fills, it’s the most capable open-source option available today.