Alibaba

CosyVoice 2.0

A 0.5B-parameter LLM-based streaming multilingual zero-shot TTS system by Alibaba's FunAudioLLM group.

0.5B paramsDense

View on Hugging Face Source Code Official Page

Our Take

Best for: Open-source tts workloads

A solid 0.5B-parameter dense audio model from Alibaba. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters0.5B

ArchitectureDense

ProviderAlibaba

Download Size4.9 GB

Community

Monthly Downloads3.8K

Likes75

Last Updated26 days ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

WER

9.0%

MOS

5.0 / 5

MBA Open Score

68.8BB

Benchmark40%

91.0

Popularity25%

27.4

Efficiency25%

76.1

Versatility10%

65.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	0.8 GB
Acer Veriton GN100 AI MiniAcer	SS	0.8 GB
AMD Instinct MI300XAMD	SS	0.8 GB
AMD Instinct MI325XAMD	SS	0.8 GB
AMD Instinct MI355XAMD	SS	0.8 GB
AMD Radeon RX 7600 8GBAMD	SS	0.8 GB
AMD Radeon RX 7700 XTAMD	SS	0.8 GB
AMD Radeon RX 7800 XTAMD	SS	0.8 GB
AMD Radeon RX 7900 XTAMD	SS	0.8 GB
AMD Radeon RX 7900 XTXAMD	SS	0.8 GB
AMD Radeon RX 9070AMD	SS	0.8 GB
AMD Radeon RX 9070 XTAMD	SS	0.8 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	0.8 GB
Apple M4Apple	SS	0.8 GB
Apple M4 Max (40-core GPU)Apple	SS	0.8 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	0.8 GB
Apple M5Apple	SS	0.8 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	0.8 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	0.8 GB
Apple Mac Mini (M1, 2020)Apple	SS	0.8 GB
Apple Mac Mini (M2, 2023)Apple	SS	0.8 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	0.8 GB
Apple Mac Mini (M4, 2024)Apple	SS	0.8 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	0.8 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	0.8 GB

Rows per page

Page 1 of 5

Rent in the Cloud

Cheapest current cloud rentals with at least 1 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM	$0.13
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM	$0.13
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM	$0.13
NVIDIA GeForce RTX 4090Vast.ai · Spot · 24 GB VRAM	$0.13

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

CosyVoice 2.0 is a streaming, multilingual, zero-shot text-to-speech system developed by Alibaba’s FunAudioLLM group. At 0.5B parameters, it occupies a unique niche: a small-footprint TTS model that doesn’t sacrifice quality for latency. Unlike larger general-purpose LLMs that can produce speech via multimodal extensions, CosyVoice 2.0 is purpose-built for synthesis, with its architecture optimized for first-packet latency as low as 150ms and human-comparable naturalness.

It’s not a chatbot or a general language model—it’s a speech synthesis engine. The model accepts text and a short voice sample (zero-shot) and outputs speech in nine languages, with fine-grained control over emotion, dialect, and speaking style. The Apache 2.0 license means you can deploy it in commercial products without friction.

Architecture & Technical Details

CosyVoice 2.0 uses a dense 0.5B parameter architecture, meaning all parameters are active during inference. There are no mixture-of-experts (MoE) routing tricks—you get predictable memory usage and inference speed. The model comprises three main components:

Finite Scalar Quantization (FSQ) replaces the traditional VQ codebook used in CosyVoice 1.0. FSQ improves codebook utilization and reduces token collapse, leading to higher quality audio with fewer artifacts.
Text-Speech Language Model is a simplified backbone that directly leverages a pre-trained LLM. This design enables the model to process text and generate speech tokens in a single autoregressive pass, without needing separate alignment or duration predictors.
Chunk-Aware Causal Flow Matching is the decoder that converts discrete speech tokens into continuous waveforms. It supports both streaming and non-streaming modes within one model. In streaming mode, the model processes audio in chunks with minimal quality loss; in non-streaming mode, it produces full sentences.

Context length is not specified by the provider, but in practice the model handles sentences up to several dozen words comfortably. For longer texts, you can feed it incrementally thanks to its streaming support.

Capabilities & Use Cases

CosyVoice 2.0 excels at zero-shot voice cloning and cross-lingual speech synthesis. Given a short reference audio (2–5 seconds), it can reproduce the speaker’s timbre and prosody in a different language. Key supported languages: Chinese (Mandarin + 18+ dialects), English, Japanese, Korean, German, Spanish, French, Italian, Russian.

Concrete use cases:

Real-time voice assistants – low 150ms latency makes it suitable for conversational AI that needs to respond quickly with a consistent voice.
Content localization – dub videos or podcasts in multiple languages while preserving the original speaker’s voice.
Audiobook narration – generate natural-sounding narration with emotional inflection control.
Accessibility tools – provide natural TTS for screen readers in multiple languages without per-language voice engineering.

Compared to CosyVoice 1.0, error rates dropped 30–50%, and MOS scores rose from 5.4 to 5.53 (tied with a commercial large-scale TTS system). The model also supports pronunciation inpainting via Chinese Pinyin or English CMU phonemes—useful for correcting rare proper nouns.

Running CosyVoice 2.0 Locally

CosyVoice 2.0 is designed to run on consumer-grade hardware. Because it’s a dense 0.5B model, memory and compute requirements are modest.

Minimum and Recommended Hardware

Quantization	VRAM Required	Example GPUs
FP16 (full)	~1.2 GB	RTX 3060 12GB, M2 Pro, RTX 4090
Q4_K_M (recommended)	~600 MB	RTX 2060 6GB, RTX 4070, M1 Mac
Q8_0	~900 MB	RTX 3060 8GB, M3 Max

For most users, Q4_K_M quantization offers the best tradeoff: quality loss is imperceptible on casual listening, and VRAM usage drops below 1 GB. You can run this model comfortably on a laptop with 8 GB RAM, no GPU required for CPU inference (though latency will increase).

Expected Performance

On an RTX 4090 (CUDA), expect 20–30 tokens per second in streaming mode, which translates to sub-100ms audio generation for short utterances. On an M4 Max (Metal), expect 15–25 tokens per second. CPU-only inference on an Apple M2 gets about 5–10 tokens per second—adequate for batch processing but not real-time.

Quick Start with Ollama

The fastest path to run CosyVoice 2.0 locally is via Ollama:

1ollama run cosyvoice2

This downloads the Q4_K_M quantized model and provides a simple API endpoint. Alternatively, you can use the official inference script from the [GitHub repo](https://github.com/FunAudioLLM/CosyVoice) for more control (e.g., adjusting chunk size, streaming mode, or language).

Hardware requirements for best results: RTX 4090, RTX 4070 Ti Super, or any GPU with at least 8 GB VRAM. For Mac users, M2 Pro or M4 Max with 16 GB unified memory will run the quantized model with good latency.

How It Compares

vs. Bark (by Suno AI) – Bark is a 0.5–1B parameter TTS model that can also do non-speech sounds and emotional tones. However, Bark is non-streaming, has higher latency (2–5 seconds for short text), and does not support cross-lingual zero-shot cloning natively. CosyVoice 2.0 wins on latency and multilingual consistency.

vs. XTTS-v2 (by Coqui) – XTTS-v2 is also a small TTS model (around 1.1B parameters) that supports voice cloning. However, its English quality is strong, but multilingual performance degrades, especially for Asian languages. CosyVoice 2.0 provides better Chinese dialect support and lower latency (150ms vs. 500ms+ for first audio packet).

When to choose CosyVoice 2.0: You need streaming, low-latency TTS with reliable cross-lingual zero-shot cloning, especially for Chinese or mixed-language content. When to avoid: If you need non-speech sounds or want a model that handles very long text (over 5 minutes) without streaming logic—Bark may produce more natural long-form prosody despite its latency.

For a 0.5B model, CosyVoice 2.0 offers an unmatched combination of speed, small footprint, and voice fidelity. It’s a practical choice for developers who want to deploy local TTS without expensive hardware or cloud dependencies.

Related Models

Alibaba

Qwen3-ASR-1.7B

1.7BDense

Alibaba

Qwen3-ASR-0.6B

0.6BDense

Explore the Provider

See all Alibaba models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Alibaba model we track.

Open Alibaba

Explore the Family

See every CosyVoice release

The full CosyVoice family leaderboard with sizes, benchmark scores, and a release timeline.

Open CosyVoice

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.

Alibaba

CosyVoice 2.0

A 0.5B-parameter LLM-based streaming multilingual zero-shot TTS system by Alibaba's FunAudioLLM group.

0.5B paramsDense

View on Hugging Face Source Code Official Page

Our Take

Best for: Open-source tts workloads

A solid 0.5B-parameter dense audio model from Alibaba. Treat the modality benchmarks above as the leading indicator of fit — composite scoring across modalities is still maturing.

Generated from this model’s benchmarks and ranking signals. Editor reviews refine it over time.

Model Specifications

Parameters0.5B

ArchitectureDense

ProviderAlibaba

Download Size4.9 GB

Community

Monthly Downloads3.8K

Likes75

Last Updated26 days ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

WER

9.0%

MOS

5.0 / 5

MBA Open Score

68.8BB

Benchmark40%

91.0

Popularity25%

27.4

Efficiency25%

76.1

Versatility10%

65.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

102 devices


ACEMAGIC M1A Pro (i9-13900HK + ARC A770)ACEMAGIC	SS	0.8 GB
Acer Veriton GN100 AI MiniAcer	SS	0.8 GB
AMD Instinct MI300XAMD	SS	0.8 GB
AMD Instinct MI325XAMD	SS	0.8 GB
AMD Instinct MI355XAMD	SS	0.8 GB
AMD Radeon RX 7600 8GBAMD	SS	0.8 GB
AMD Radeon RX 7700 XTAMD	SS	0.8 GB
AMD Radeon RX 7800 XTAMD	SS	0.8 GB
AMD Radeon RX 7900 XTAMD	SS	0.8 GB
AMD Radeon RX 7900 XTXAMD	SS	0.8 GB
AMD Radeon RX 9070AMD	SS	0.8 GB
AMD Radeon RX 9070 XTAMD	SS	0.8 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	0.8 GB
Apple M4Apple	SS	0.8 GB
Apple M4 Max (40-core GPU)Apple	SS	0.8 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	0.8 GB
Apple M5Apple	SS	0.8 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	0.8 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	0.8 GB
Apple Mac Mini (M1, 2020)Apple	SS	0.8 GB
Apple Mac Mini (M2, 2023)Apple	SS	0.8 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	0.8 GB
Apple Mac Mini (M4, 2024)Apple	SS	0.8 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	0.8 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	0.8 GB

Rows per page

Page 1 of 5

Rent in the Cloud

Cheapest current cloud rentals with at least 1 GB VRAM, refreshed hourly.

Option	Cost / GPU-hour
NVIDIA GeForce RTX 5070 TiVast.ai · Spot · 16 GB VRAM	$0.11
NVIDIA GeForce RTX 3070RunPod · Community · 8 GB VRAM	$0.13
NVIDIA GeForce RTX 3070RunPod · Spot · 8 GB VRAM	$0.13
NVIDIA GeForce RTX 5090Vast.ai · Spot · 32 GB VRAM	$0.13
NVIDIA GeForce RTX 4090Vast.ai · Spot · 24 GB VRAM	$0.13

Per-GPU rate across RunPod and the Vast.ai marketplace.

Spot tier is interruptible. Plan for restarts when comparing against on-demand prices.

See the full price index

About This Model

Overview

Architecture & Technical Details

Finite Scalar Quantization (FSQ) replaces the traditional VQ codebook used in CosyVoice 1.0. FSQ improves codebook utilization and reduces token collapse, leading to higher quality audio with fewer artifacts.
Text-Speech Language Model is a simplified backbone that directly leverages a pre-trained LLM. This design enables the model to process text and generate speech tokens in a single autoregressive pass, without needing separate alignment or duration predictors.
Chunk-Aware Causal Flow Matching is the decoder that converts discrete speech tokens into continuous waveforms. It supports both streaming and non-streaming modes within one model. In streaming mode, the model processes audio in chunks with minimal quality loss; in non-streaming mode, it produces full sentences.

Capabilities & Use Cases

Concrete use cases:

Real-time voice assistants – low 150ms latency makes it suitable for conversational AI that needs to respond quickly with a consistent voice.
Content localization – dub videos or podcasts in multiple languages while preserving the original speaker’s voice.
Audiobook narration – generate natural-sounding narration with emotional inflection control.
Accessibility tools – provide natural TTS for screen readers in multiple languages without per-language voice engineering.

Running CosyVoice 2.0 Locally

CosyVoice 2.0 is designed to run on consumer-grade hardware. Because it’s a dense 0.5B model, memory and compute requirements are modest.

Minimum and Recommended Hardware

Quantization	VRAM Required	Example GPUs
FP16 (full)	~1.2 GB	RTX 3060 12GB, M2 Pro, RTX 4090
Q4_K_M (recommended)	~600 MB	RTX 2060 6GB, RTX 4070, M1 Mac
Q8_0	~900 MB	RTX 3060 8GB, M3 Max

Expected Performance

Quick Start with Ollama

The fastest path to run CosyVoice 2.0 locally is via Ollama:

1ollama run cosyvoice2

How It Compares

Related Models

Alibaba

Qwen3-ASR-1.7B

1.7BDense

Alibaba

Qwen3-ASR-0.6B

0.6BDense

Explore the Provider

See all Alibaba models

Aggregate stats, leaderboard, release timeline, and benchmark coverage across every Alibaba model we track.

Open Alibaba

Explore the Family

See every CosyVoice release

The full CosyVoice family leaderboard with sizes, benchmark scores, and a release timeline.

Open CosyVoice

Free Monthly Report

The AI Build Report

The state of AI models, API prices, and what to run where. New every month, free.