Fish Audio

Fish Speech v1.5

Fish Audio's multilingual open-source TTS model using a Dual-AR LLM-based architecture, trained on over 1M hours of audio across 13 languages.

B paramsDense

View on Hugging Face Source Code Official Page

Model Specifications

ParametersnullB

ArchitectureDense

ProviderFish Audio

Download Size2.7 GB

Community

Monthly Downloads6.2K

Likes741

Last Updated1 years ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

CC-BY-NC-SA-4.0View Full License

Performance & Scoring

Benchmarks

No benchmark data available for this model yet.

Overall Score

54.2CC

Benchmark40%

50.0

Popularity25%

46.0

Efficiency25%

68.9

Versatility10%

55.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

85 devices


Acer Veriton GN100 AI MiniAcer	SS	0.5 GB
AMD Instinct MI300XAMD	SS	0.5 GB
AMD Instinct MI325XAMD	SS	0.5 GB
AMD Instinct MI355XAMD	SS	0.5 GB
AMD Radeon RX 7600 8GBAMD	SS	0.5 GB
AMD Radeon RX 7700 XTAMD	SS	0.5 GB
AMD Radeon RX 7800 XTAMD	SS	0.5 GB
AMD Radeon RX 7900 XTAMD	SS	0.5 GB
AMD Radeon RX 7900 XTXAMD	SS	0.5 GB
AMD Radeon RX 9070AMD	SS	0.5 GB
AMD Radeon RX 9070 XTAMD	SS	0.5 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	0.5 GB
Apple M4Apple	SS	0.5 GB
Apple M4 Max (40-core GPU)Apple	SS	0.5 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	0.5 GB
Apple M5Apple	SS	0.5 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	0.5 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	0.5 GB
Apple Mac Mini (M1, 2020)Apple	SS	0.5 GB
Apple Mac Mini (M2, 2023)Apple	SS	0.5 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	0.5 GB
Apple Mac Mini (M4, 2024)Apple	SS	0.5 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	0.5 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	0.5 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	SS	0.5 GB

Rows per page

Page 1 of 4

About This Model

Fish Speech v1.5 is a state-of-the-art, multilingual text-to-speech (TTS) model developed by Fish Audio. Built on a unique Dual-Autoregressive (Dual-AR) architecture, it shifts away from traditional diffusion-based or GAN-based TTS methods in favor of a Large Language Model (LLM) approach to speech synthesis. By treating audio as a sequence of discrete tokens, Fish Speech v1.5 achieves a level of prosody, emotional inflection, and linguistic fluidity that positions it as a primary local alternative to proprietary services like ElevenLabs.

The model is trained on a massive dataset exceeding 1 million hours of audio across 13 languages, with heavy weighting toward English and Chinese (over 300k hours each). For developers and engineers running Fish Speech v1.5 locally, the model offers a "zero-shot" voice cloning capability: providing a reference audio clip as short as 10 seconds allows the model to replicate the speaker's timbre and rhythm with high fidelity.

Architecture and Technical Details

The core of Fish Speech v1.5 is its Dual-AR LLM-based architecture. Unlike models that predict mel-spectrograms in a single pass, Fish Speech processes speech through two distinct stages:

Semantic to Acoustic Mapping: The model uses a LLM backbone to convert text input into discrete acoustic tokens. This allows the model to "understand" context, punctuation, and emphasis in the same way a text-based LLM handles natural language.
VITS-style Decoder/VQGAN: The discrete tokens are then decoded into high-quality waveform audio.

This architecture is "dense," meaning every parameter is active during inference. While Fish Audio has not disclosed the exact parameter count for the v1.5 weights, the architectural lineage and performance benchmarks suggest a footprint that fits comfortably within modern consumer GPU memory limits. The model's primary advantage is its "emotion control" via bracketed tags—such as [laughing], [whispering], or [angry]—which are processed natively by the LLM as part of the input sequence.

Capabilities and Practical Use Cases

Fish Speech v1.5 is designed for high-throughput, expressive audio generation. It excels in environments where emotional nuance is as important as verbal clarity.

Zero-Shot Voice Cloning: Practitioners can clone a target voice by simply providing a short reference WAV file. This is highly effective for creating consistent character voices in gaming or personalized assistants without needing to fine-tune the entire model.
Multilingual Narration: With 100k+ hours of Japanese and 20k+ hours of French, German, and Spanish training data, the model handles code-switching and multilingual scripts with minimal accent contamination.
Emotionally Tagged Synthesis: The model supports fine-grained control over delivery. By inserting tags like [soft] or [excited] into the text string, users can manipulate the output without post-processing.
Low-Latency Conversational AI: Because the model is optimized for fast inference (often reaching a Real-Time Factor of 0.19 or better), it is suitable for local LLM-integrated agents and real-time NPCs in game development.

Running Fish Speech v1.5 Locally

To run Fish Speech v1.5 locally, you must account for both the LLM engine and the VQGAN decoder. While the model is highly efficient, audio synthesis is computationally intensive compared to text generation.

Hardware Requirements

Minimum VRAM: 8GB (RTX 3060 / 4060). This allows for basic inference but may limit the length of the generated audio or the complexity of the reference clones.
Recommended VRAM: 16GB - 24GB (RTX 4080 / 4090). This provides enough overhead to run the model alongside a local LLM (like Llama 3) for a full voice-assistant pipeline.
Apple Silicon: M2/M3/M4 Pro or Max chips with 32GB+ of Unified Memory provide excellent performance due to the high memory bandwidth required for autoregressive sampling.

Recommended Quantization and Performance

For most practitioners, running the model in FP16 or BF16 is preferred to maintain the nuances of the voice clones. Unlike text LLMs where Q4_K_M quantization has negligible impact on logic, heavy quantization in TTS models can occasionally introduce metallic artifacts or "robotic" jitter in the audio.

Best Quantization: Use FP16 if VRAM allows. If you are memory-constrained, look for GGUF or EXL2 versions at Q8_0 to preserve audio fidelity.
Performance Expectation: On an RTX 4090, you can expect a Real-Time Factor (RTF) of approximately 0.15–0.20, meaning 1 minute of audio is generated in roughly 10 seconds.

Implementation Methods

The most direct way to get started is via the official Fish Speech GitHub repository, which includes a Gradio-based WebUI and a CLI for batch processing. For those looking to integrate Fish Speech into a broader agentic workflow, the model can be served via an OpenAI-compatible API wrapper, allowing it to act as the "mouth" for local LLM deployments.

How It Compares

Fish Speech v1.5 occupies a unique space between lightweight "edge" TTS and massive proprietary models.

Fish Speech v1.5 vs. XTTS v2: XTTS v2 is a common baseline for local voice cloning. Fish Speech generally offers superior emotional control and handles longer sequences with fewer "hallucinated" noises (breaths, clicks) compared to XTTS. However, XTTS may have slightly lower VRAM requirements for basic tasks.
Fish Speech v1.5 vs. ElevenLabs (Cloud): While ElevenLabs remains the gold standard for zero-effort quality, Fish Speech v1.5 is the most competitive local alternative. It removes the per-character cost and privacy concerns of cloud APIs. In terms of raw quality, Fish Speech is widely considered the #1 ranked open-weights model on the TTS-Arena leaderboard as of early 2024.
Fish Speech v1.5 vs. Parler-TTS: Parler-TTS offers more granular control over "style" through text descriptions (e.g., "a woman speaking in a high-pitched voice in a small room"), whereas Fish Speech is better suited for direct voice cloning and tag-based emotion.

For developers building local-first applications, Fish Speech v1.5 is the current industry standard for high-fidelity, emotionally controllable speech synthesis. Its CC-BY-NC-SA-4.0 license allows for extensive personal and research use, though commercial applications require a separate agreement with Fish Audio.

Related Models

Fish Audio

Fish Speech v1.4

BDense

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.

Fish Audio

Fish Speech v1.5

Fish Audio's multilingual open-source TTS model using a Dual-AR LLM-based architecture, trained on over 1M hours of audio across 13 languages.

B paramsDense

View on Hugging Face Source Code Official Page

Model Specifications

ParametersnullB

ArchitectureDense

ProviderFish Audio

Download Size2.7 GB

Community

Monthly Downloads6.2K

Likes741

Last Updated1 years ago

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

CC-BY-NC-SA-4.0View Full License

Performance & Scoring

Benchmarks

No benchmark data available for this model yet.

Overall Score

54.2CC

Benchmark40%

50.0

Popularity25%

46.0

Efficiency25%

68.9

Versatility10%

55.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

85 devices


Acer Veriton GN100 AI MiniAcer	SS	0.5 GB
AMD Instinct MI300XAMD	SS	0.5 GB
AMD Instinct MI325XAMD	SS	0.5 GB
AMD Instinct MI355XAMD	SS	0.5 GB
AMD Radeon RX 7600 8GBAMD	SS	0.5 GB
AMD Radeon RX 7700 XTAMD	SS	0.5 GB
AMD Radeon RX 7800 XTAMD	SS	0.5 GB
AMD Radeon RX 7900 XTAMD	SS	0.5 GB
AMD Radeon RX 7900 XTXAMD	SS	0.5 GB
AMD Radeon RX 9070AMD	SS	0.5 GB
AMD Radeon RX 9070 XTAMD	SS	0.5 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	0.5 GB
Apple M4Apple	SS	0.5 GB
Apple M4 Max (40-core GPU)Apple	SS	0.5 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	0.5 GB
Apple M5Apple	SS	0.5 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	0.5 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	0.5 GB
Apple Mac Mini (M1, 2020)Apple	SS	0.5 GB
Apple Mac Mini (M2, 2023)Apple	SS	0.5 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	0.5 GB
Apple Mac Mini (M4, 2024)Apple	SS	0.5 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	0.5 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	0.5 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	SS	0.5 GB

Rows per page

Page 1 of 4

About This Model

Architecture and Technical Details

The core of Fish Speech v1.5 is its Dual-AR LLM-based architecture. Unlike models that predict mel-spectrograms in a single pass, Fish Speech processes speech through two distinct stages:

Semantic to Acoustic Mapping: The model uses a LLM backbone to convert text input into discrete acoustic tokens. This allows the model to "understand" context, punctuation, and emphasis in the same way a text-based LLM handles natural language.
VITS-style Decoder/VQGAN: The discrete tokens are then decoded into high-quality waveform audio.

Capabilities and Practical Use Cases

Fish Speech v1.5 is designed for high-throughput, expressive audio generation. It excels in environments where emotional nuance is as important as verbal clarity.

Zero-Shot Voice Cloning: Practitioners can clone a target voice by simply providing a short reference WAV file. This is highly effective for creating consistent character voices in gaming or personalized assistants without needing to fine-tune the entire model.
Multilingual Narration: With 100k+ hours of Japanese and 20k+ hours of French, German, and Spanish training data, the model handles code-switching and multilingual scripts with minimal accent contamination.
Emotionally Tagged Synthesis: The model supports fine-grained control over delivery. By inserting tags like [soft] or [excited] into the text string, users can manipulate the output without post-processing.
Low-Latency Conversational AI: Because the model is optimized for fast inference (often reaching a Real-Time Factor of 0.19 or better), it is suitable for local LLM-integrated agents and real-time NPCs in game development.

Running Fish Speech v1.5 Locally

Hardware Requirements

Minimum VRAM: 8GB (RTX 3060 / 4060). This allows for basic inference but may limit the length of the generated audio or the complexity of the reference clones.
Recommended VRAM: 16GB - 24GB (RTX 4080 / 4090). This provides enough overhead to run the model alongside a local LLM (like Llama 3) for a full voice-assistant pipeline.
Apple Silicon: M2/M3/M4 Pro or Max chips with 32GB+ of Unified Memory provide excellent performance due to the high memory bandwidth required for autoregressive sampling.

Recommended Quantization and Performance

Best Quantization: Use FP16 if VRAM allows. If you are memory-constrained, look for GGUF or EXL2 versions at Q8_0 to preserve audio fidelity.
Performance Expectation: On an RTX 4090, you can expect a Real-Time Factor (RTF) of approximately 0.15–0.20, meaning 1 minute of audio is generated in roughly 10 seconds.

Implementation Methods

How It Compares

Fish Speech v1.5 occupies a unique space between lightweight "edge" TTS and massive proprietary models.

Fish Speech v1.5 vs. XTTS v2: XTTS v2 is a common baseline for local voice cloning. Fish Speech generally offers superior emotional control and handles longer sequences with fewer "hallucinated" noises (breaths, clicks) compared to XTTS. However, XTTS may have slightly lower VRAM requirements for basic tasks.
Fish Speech v1.5 vs. ElevenLabs (Cloud): While ElevenLabs remains the gold standard for zero-effort quality, Fish Speech v1.5 is the most competitive local alternative. It removes the per-character cost and privacy concerns of cloud APIs. In terms of raw quality, Fish Speech is widely considered the #1 ranked open-weights model on the TTS-Arena leaderboard as of early 2024.
Fish Speech v1.5 vs. Parler-TTS: Parler-TTS offers more granular control over "style" through text descriptions (e.g., "a woman speaking in a high-pitched voice in a small room"), whereas Fish Speech is better suited for direct voice cloning and tag-based emotion.

Related Models

Fish Audio

Fish Speech v1.4

BDense

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.