
A 0.5B-parameter LLM-based streaming multilingual zero-shot TTS system by Alibaba's FunAudioLLM group.
A solid 0.5B-parameter dense audio model from Alibaba. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.
Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.
Access model weights, configuration files, and documentation.
See which devices can run this model and at what quality level.
Cheapest current cloud rentals with at least 1 GB VRAM, refreshed hourly.
| Option | Cost / GPU-hour |
|---|---|
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM | $0.11 |
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM | $0.13 |
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM | $0.13 |
NVIDIA GeForce RTX 4090Vast.ai · Spot · 24 GB VRAM | $0.13 |
Per-GPU rate across RunPod and the Vast.ai marketplace.
Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.
CosyVoice 2.0 is a streaming, multilingual, zero-shot text-to-speech system developed by Alibaba’s FunAudioLLM group. At 0.5B parameters, it occupies a unique niche: a small-footprint TTS model that doesn’t sacrifice quality for latency. Unlike larger general-purpose LLMs that can produce speech via multimodal extensions, CosyVoice 2.0 is purpose-built for synthesis, with its architecture optimized for first-packet latency as low as 150ms and human-comparable naturalness.
It’s not a chatbot or a general language model—it’s a speech synthesis engine. The model accepts text and a short voice sample (zero-shot) and outputs speech in nine languages, with fine-grained control over emotion, dialect, and speaking style. The Apache 2.0 license means you can deploy it in commercial products without friction.
CosyVoice 2.0 uses a dense 0.5B parameter architecture, meaning all parameters are active during inference. There are no mixture-of-experts (MoE) routing tricks—you get predictable memory usage and inference speed. The model comprises three main components:
Context length is not specified by the provider, but in practice the model handles sentences up to several dozen words comfortably. For longer texts, you can feed it incrementally thanks to its streaming support.
CosyVoice 2.0 excels at zero-shot voice cloning and cross-lingual speech synthesis. Given a short reference audio (2–5 seconds), it can reproduce the speaker’s timbre and prosody in a different language. Key supported languages: Chinese (Mandarin + 18+ dialects), English, Japanese, Korean, German, Spanish, French, Italian, Russian.
Concrete use cases:
Compared to CosyVoice 1.0, error rates dropped 30–50%, and MOS scores rose from 5.4 to 5.53 (tied with a commercial large-scale TTS system). The model also supports pronunciation inpainting via Chinese Pinyin or English CMU phonemes—useful for correcting rare proper nouns.
CosyVoice 2.0 is designed to run on consumer-grade hardware. Because it’s a dense 0.5B model, memory and compute requirements are modest.
| Quantization | VRAM Required | Example GPUs |
|---|---|---|
| FP16 (full) | ~1.2 GB | RTX 3060 12GB, M2 Pro, RTX 4090 |
| Q4_K_M (recommended) | ~600 MB | RTX 2060 6GB, RTX 4070, M1 Mac |
| Q8_0 | ~900 MB | RTX 3060 8GB, M3 Max |
For most users, Q4_K_M quantization offers the best tradeoff: quality loss is imperceptible on casual listening, and VRAM usage drops below 1 GB. You can run this model comfortably on a laptop with 8 GB RAM, no GPU required for CPU inference (though latency will increase).
On an RTX 4090 (CUDA), expect 20–30 tokens per second in streaming mode, which translates to sub-100ms audio generation for short utterances. On an M4 Max (Metal), expect 15–25 tokens per second. CPU-only inference on an Apple M2 gets about 5–10 tokens per second—adequate for batch processing but not real-time.
The fastest path to run CosyVoice 2.0 locally is via Ollama:
1ollama run cosyvoice2
This downloads the Q4_K_M quantized model and provides a simple API endpoint. Alternatively, you can use the official inference script from the [GitHub repo](https://github.com/FunAudioLLM/CosyVoice) for more control (e.g., adjusting chunk size, streaming mode, or language).
Hardware requirements for best results: RTX 4090, RTX 4070 Ti Super, or any GPU with at least 8 GB VRAM. For Mac users, M2 Pro or M4 Max with 16 GB unified memory will run the quantized model with good latency.
vs. Bark (by Suno AI) – Bark is a 0.5–1B parameter TTS model that can also do non-speech sounds and emotional tones. However, Bark is non-streaming, has higher latency (2–5 seconds for short text), and does not support cross-lingual zero-shot cloning natively. CosyVoice 2.0 wins on latency and multilingual consistency.
vs. XTTS-v2 (by Coqui) – XTTS-v2 is also a small TTS model (around 1.1B parameters) that supports voice cloning. However, its English quality is strong, but multilingual performance degrades, especially for Asian languages. CosyVoice 2.0 provides better Chinese dialect support and lower latency (150ms vs. 500ms+ for first audio packet).
When to choose CosyVoice 2.0: You need streaming, low-latency TTS with reliable cross-lingual zero-shot cloning, especially for Chinese or mixed-language content. When to avoid: If you need non-speech sounds or want a model that handles very long text (over 5 minutes) without streaming logic—Bark may produce more natural long-form prosody despite its latency.
For a 0.5B model, CosyVoice 2.0 offers an unmatched combination of speed, small footprint, and voice fidelity. It’s a practical choice for developers who want to deploy local TTS without expensive hardware or cloud dependencies.

Explore the Provider
Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Alibaba model we track.

Explore the Family
The full CosyVoice family leaderboard with sizes, benchmark scores, and a release timeline.