Alibaba FunAudioLLM

CosyVoice 2.0

A 0.5B-parameter LLM-based streaming multilingual zero-shot TTS system by Alibaba's FunAudioLLM group.

0.5B paramsDense

View on Hugging Face Source Code Official Page

Model Specifications

Parameters0.5B

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

No benchmark data available for this model yet.

Overall Score

41.7CC

Benchmark40%

50.0

Popularity25%

25.3

Efficiency25%

35.6

Versatility10%

65.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

85 devices


Acer Veriton GN100 AI MiniAcer	SS	0.8 GB
AMD Instinct MI300XAMD	SS	0.8 GB
AMD Instinct MI325XAMD	SS	0.8 GB
AMD Instinct MI355XAMD	SS	0.8 GB
AMD Radeon RX 7600 8GBAMD	SS	0.8 GB
AMD Radeon RX 7700 XTAMD	SS	0.8 GB
AMD Radeon RX 7800 XTAMD	SS	0.8 GB
AMD Radeon RX 7900 XTAMD	SS	0.8 GB
AMD Radeon RX 7900 XTXAMD	SS	0.8 GB
AMD Radeon RX 9070AMD	SS	0.8 GB
AMD Radeon RX 9070 XTAMD	SS	0.8 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	0.8 GB
Apple M4Apple	SS	0.8 GB
Apple M4 Max (40-core GPU)Apple	SS	0.8 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	0.8 GB
Apple M5Apple	SS	0.8 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	0.8 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	0.8 GB
Apple Mac Mini (M1, 2020)Apple	SS	0.8 GB
Apple Mac Mini (M2, 2023)Apple	SS	0.8 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	0.8 GB
Apple Mac Mini (M4, 2024)Apple	SS	0.8 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	0.8 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	0.8 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	SS	0.8 GB

Rows per page

Page 1 of 4

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.

About This Model

CosyVoice 2.0 is a 0.5B-parameter large language model (LLM) purpose-built for streaming multilingual speech synthesis. Developed by Alibaba’s FunAudioLLM group, it represents a significant architectural shift from traditional TTS systems by utilizing a dense LLM backbone to handle text-to-speech tasks. This model is designed specifically for developers building real-time interactive agents where low latency and high emotional fidelity are non-negotiable.

Unlike many TTS models that rely on heavy external frontend modules for text normalization, CosyVoice 2.0 handles complex text formats, special symbols, and numbers natively through its LLM architecture. It occupies a unique space in the local AI ecosystem, competing with models like GPT-SoVITS or Bark, but with a specific focus on "bi-streaming"—the ability to process text-in and audio-out simultaneously with minimal delay.

Architecture & Technical Details

The core of CosyVoice 2.0 is a 0.5B dense parameter architecture that streamlines the text-to-speech pipeline. It moves away from the complex, multi-stage decoding of version 1.0 in favor of a more unified approach. The system utilizes Finite-Scalar Quantization (FSQ) to improve the utilization of its speech token codebook, which directly impacts the richness and stability of the generated voice.

The model employs two primary generative components: a text-speech LLM for semantic decoding and a chunk-aware causal Flow Matching model for acoustic synthesis. This combination allows the model to function in both streaming and non-streaming modes within a single weight file. By using a pre-trained LLM as the backbone, the model inherits sophisticated linguistic understanding, which translates to a 30% to 50% reduction in pronunciation errors compared to previous iterations.

For practitioners running CosyVoice 2.0 locally, the 0.5B parameter count is highly efficient. Because it is a dense model rather than a Mixture of Experts (MoE), the VRAM footprint is predictable and the entire model stays active during inference, ensuring consistent performance across different hardware configurations.

Capabilities & Use Cases

CosyVoice 2.0 excels in zero-shot voice cloning and cross-lingual synthesis. It is particularly effective for:

Low-Latency Conversational AI: With a first-packet latency as low as 150ms, it is ideal for local LLM-based assistants where the "thinking" time of the brain model is already a bottleneck.
Multilingual Content Creation: It supports 9 core languages—including Chinese, English, Japanese, Korean, German, Spanish, French, Italian, and Russian—along with over 18 Chinese dialects.
Emotionally Expressive TTS: The model supports "Instruct" capabilities, allowing users to influence the output through tags for emotion, speed, volume, and specific dialect accents.
Pronunciation Inpainting: Developers can use Chinese Pinyin or English CMU phonemes to manually correct or "paint" specific pronunciations, a feature critical for production environments where brand names or technical jargon must be precise.

Running CosyVoice 2.0 Locally

Running CosyVoice 2.0 locally is remarkably accessible due to its small 0.5B footprint. While it is an LLM-based system, its specialized task means it does not require the massive VRAM overhead of 70B general-purpose models.

Hardware Requirements & VRAM

To run CosyVoice 2.0 with optimal performance, consider the following hardware targets:

Minimum VRAM: 4GB. This allows for the model and the necessary flow-matching tensors to reside in memory. An NVIDIA RTX 3060 (12GB) or even an older RTX 2060 will handle this comfortably.
Recommended Hardware: 8GB+ VRAM. An NVIDIA RTX 4060 Ti or an Apple M-series chip (M2/M3/M4) with 16GB of Unified Memory provides enough headroom for the TTS model to run alongside a small "brain" LLM (like Llama 3.2 3B or Phi-3.5).
Performance: On consumer-grade GPUs like the RTX 4090, the real-time factor (RTF) is exceptionally low, often generating audio 10-20x faster than real-time playback.

Quantization & Deployment

While FP16 is the standard for maximum fidelity, Q4_K_M quantization via tools like GGUF or through vLLM integration is recommended for users looking to maximize throughput on mid-range hardware.

The most efficient way to get started is via the official FunAudioLLM repository, which provides Gradio-based web UIs and deployment scripts. For those integrating into existing pipelines, CosyVoice 2.0 has added vLLM support, allowing it to be served via an OpenAI-compatible API, making it a drop-in replacement for cloud-based TTS services.

How It Compares

CosyVoice 2.0 sits in a competitive bracket of small-scale, high-performance audio models.

CosyVoice 2.0 vs. GPT-SoVITS: GPT-SoVITS is often favored for its community-driven fine-tuning tools and "few-shot" capabilities. However, CosyVoice 2.0 generally offers superior streaming performance and lower first-packet latency, making it better suited for interactive agents than batch audio generation.
CosyVoice 2.0 vs. Bark (Sunno): Bark is known for its creative, almost unpredictable "hallucinated" sounds (laughter, sighs), but it is notoriously difficult to control and slow to run. CosyVoice 2.0 is a much more "stable" production tool with better text normalization and predictable emotional control.
CosyVoice 2.0 vs. CosyVoice 1.0: The 2.0 version is a significant upgrade in stability. The 1.0 version frequently struggled with cross-language timbre consistency; 2.0 fixes this, ensuring that a cloned voice sounds the same whether it is speaking English, Japanese, or German.

For developers who need a local, Apache 2.0-licensed TTS that can keep up with the speed of a human conversation, CosyVoice 2.0 is currently the most robust 0.5B option available.

CosyVoice 2.0

Model Specifications

Quick Start

Download from Hugging Face

License

Performance & Scoring

Benchmarks

Overall Score

Hardware Compatibility

Find the best hardware for this model

Community

About This Model

Architecture & Technical Details

Capabilities & Use Cases

Running CosyVoice 2.0 Locally

Hardware Requirements & VRAM

Quantization & Deployment

How It Compares