Multilingual open-weights TTS model by Fish Audio using an LLM-based dual-AR architecture over VQ audio tokens, trained on 700k hours across 8 languages.
Access model weights, configuration files, and documentation.
No benchmark data available for this model yet.
See which devices can run this model and at what quality level.
Fish Speech v1.4 is a state-of-the-art, open-weights text-to-speech (TTS) system developed by Fish Audio. Unlike traditional TTS engines that rely on concatenative synthesis or simple neural vocoders, Fish Speech v1.4 adopts a Large Language Model (LLM) approach. It treats audio as a sequence of discrete VQ (Vector Quantized) tokens, allowing it to leverage the same transformer-based architectures that power modern LLMs.
This model is specifically designed for high-fidelity voice cloning and expressive speech generation across eight major languages. By training on a massive 700,000-hour dataset—a significant jump from the 200,000 hours used in previous iterations—Fish Audio has positioned v1.4 as a primary local alternative to proprietary APIs like ElevenLabs. For developers and engineers, Fish Speech v1.4 represents a shift toward "Speech-Language Models" (SLMs) that can be fully self-hosted for privacy-sensitive or latency-critical applications.
The core of Fish Speech v1.4 is a Dual-Autoregressive (Dual-AR) architecture. This design separates the process into two distinct stages:
Because it operates over VQ audio tokens, the model doesn't just "read" text; it predicts the prosody, emotion, and timbre of the speech based on the provided reference audio. While the exact parameter count remains undisclosed by Fish Audio, the architecture's reliance on dense transformer blocks suggests a performance profile similar to mid-sized LLMs.
The model supports a multilingual repertoire including English, Chinese, German, Japanese, French, Spanish, Korean, and Arabic. The training data is heavily weighted toward English and Chinese (roughly 300k hours each), with the remaining languages supported by approximately 20k hours of data each. This scale ensures that the model captures subtle nuances in dialect and tone that smaller, specialized TTS models often miss.
Fish Speech v1.4 excels in zero-shot voice cloning. By providing a short (5-10 second) audio sample, users can generate speech that mimics the target voice with high accuracy.
[angry], [whispering], or [excited], practitioners can control the emotional output of the model. This is a critical feature for game developers creating NPC dialogue or creators producing audiobooks.[laughing] or [sighing] makes interactions feel significantly more human than standard robotic TTS.Running Fish Speech v1.4 locally requires a modern GPU with a focus on VRAM capacity and memory bandwidth. Because the model uses an LLM-based backbone, hardware requirements are more intensive than older models like Coqui TTS or Piper.
To maximize performance on consumer hardware, quantization is highly recommended. While the model is often distributed in FP16 or BF16, converting to 4-bit or 8-bit formats (using tools like GGUF or EXL2) can significantly reduce the VRAM footprint without a perceptible loss in audio quality.
For most users, Q4_K_M or Q8_0 quantization provides the best balance. Since audio quality is highly sensitive to token "jitter," 8-bit quantization is generally preferred over 4-bit if your VRAM allows it, as it preserves the subtle emotional inflections in the VQ tokens more effectively.
Fish Speech v1.4 competes in a narrow field of high-end, open-weights TTS models.
Fish Speech v1.4 is released under the CC-BY-NC-SA-4.0 license. This is a "Non-Commercial" license, meaning you can use it for research, personal projects, and experimentation, but you cannot use the model weights for commercial profit without a separate agreement from Fish Audio. The source code itself is released under the more permissive BSD-3-Clause license. Practitioners should ensure their use case aligns with these legal boundaries before deploying in a production environment.