Fish Audio's multilingual open-source TTS model using a Dual-AR LLM-based architecture, trained on over 1M hours of audio across 13 languages.
Access model weights, configuration files, and documentation.
No benchmark data available for this model yet.
See which devices can run this model and at what quality level.
Fish Speech v1.5 is a state-of-the-art, multilingual text-to-speech (TTS) model developed by Fish Audio. Built on a unique Dual-Autoregressive (Dual-AR) architecture, it shifts away from traditional diffusion-based or GAN-based TTS methods in favor of a Large Language Model (LLM) approach to speech synthesis. By treating audio as a sequence of discrete tokens, Fish Speech v1.5 achieves a level of prosody, emotional inflection, and linguistic fluidity that positions it as a primary local alternative to proprietary services like ElevenLabs.
The model is trained on a massive dataset exceeding 1 million hours of audio across 13 languages, with heavy weighting toward English and Chinese (over 300k hours each). For developers and engineers running Fish Speech v1.5 locally, the model offers a "zero-shot" voice cloning capability: providing a reference audio clip as short as 10 seconds allows the model to replicate the speaker's timbre and rhythm with high fidelity.
The core of Fish Speech v1.5 is its Dual-AR LLM-based architecture. Unlike models that predict mel-spectrograms in a single pass, Fish Speech processes speech through two distinct stages:
This architecture is "dense," meaning every parameter is active during inference. While Fish Audio has not disclosed the exact parameter count for the v1.5 weights, the architectural lineage and performance benchmarks suggest a footprint that fits comfortably within modern consumer GPU memory limits. The model's primary advantage is its "emotion control" via bracketed tags—such as [laughing], [whispering], or [angry]—which are processed natively by the LLM as part of the input sequence.
Fish Speech v1.5 is designed for high-throughput, expressive audio generation. It excels in environments where emotional nuance is as important as verbal clarity.
[soft] or [excited] into the text string, users can manipulate the output without post-processing.To run Fish Speech v1.5 locally, you must account for both the LLM engine and the VQGAN decoder. While the model is highly efficient, audio synthesis is computationally intensive compared to text generation.
For most practitioners, running the model in FP16 or BF16 is preferred to maintain the nuances of the voice clones. Unlike text LLMs where Q4_K_M quantization has negligible impact on logic, heavy quantization in TTS models can occasionally introduce metallic artifacts or "robotic" jitter in the audio.
The most direct way to get started is via the official Fish Speech GitHub repository, which includes a Gradio-based WebUI and a CLI for batch processing. For those looking to integrate Fish Speech into a broader agentic workflow, the model can be served via an OpenAI-compatible API wrapper, allowing it to act as the "mouth" for local LLM deployments.
Fish Speech v1.5 occupies a unique space between lightweight "edge" TTS and massive proprietary models.
For developers building local-first applications, Fish Speech v1.5 is the current industry standard for high-fidelity, emotionally controllable speech synthesis. Its CC-BY-NC-SA-4.0 license allows for extensive personal and research use, though commercial applications require a separate agreement with Fish Audio.