Fish Audio

Fish Speech v1.4

Multilingual open-weights TTS model by Fish Audio using an LLM-based dual-AR architecture over VQ audio tokens, trained on 700k hours across 8 languages.

B paramsDense

View on Hugging Face Source Code Official Page

Model Specifications

ParametersnullB

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

CC-BY-NC-SA-4.0View Full License

Performance & Scoring

Benchmarks

No benchmark data available for this model yet.

Overall Score

51.6CC

Benchmark40%

50.0

Popularity25%

35.3

Efficiency25%

68.9

Versatility10%

55.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

85 devices


Acer Veriton GN100 AI MiniAcer	SS	0.5 GB
AMD Instinct MI300XAMD	SS	0.5 GB
AMD Instinct MI325XAMD	SS	0.5 GB
AMD Instinct MI355XAMD	SS	0.5 GB
AMD Radeon RX 7600 8GBAMD	SS	0.5 GB
AMD Radeon RX 7700 XTAMD	SS	0.5 GB
AMD Radeon RX 7800 XTAMD	SS	0.5 GB
AMD Radeon RX 7900 XTAMD	SS	0.5 GB
AMD Radeon RX 7900 XTXAMD	SS	0.5 GB
AMD Radeon RX 9070AMD	SS	0.5 GB
AMD Radeon RX 9070 XTAMD	SS	0.5 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	0.5 GB
Apple M4Apple	SS	0.5 GB
Apple M4 Max (40-core GPU)Apple	SS	0.5 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	0.5 GB
Apple M5Apple	SS	0.5 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	0.5 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	0.5 GB
Apple Mac Mini (M1, 2020)Apple	SS	0.5 GB
Apple Mac Mini (M2, 2023)Apple	SS	0.5 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	0.5 GB
Apple Mac Mini (M4, 2024)Apple	SS	0.5 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	0.5 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	0.5 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	SS	0.5 GB

Rows per page

Page 1 of 4

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.

About This Model

Fish Speech v1.4 is a state-of-the-art, open-weights text-to-speech (TTS) system developed by Fish Audio. Unlike traditional TTS engines that rely on concatenative synthesis or simple neural vocoders, Fish Speech v1.4 adopts a Large Language Model (LLM) approach. It treats audio as a sequence of discrete VQ (Vector Quantized) tokens, allowing it to leverage the same transformer-based architectures that power modern LLMs.

This model is specifically designed for high-fidelity voice cloning and expressive speech generation across eight major languages. By training on a massive 700,000-hour dataset—a significant jump from the 200,000 hours used in previous iterations—Fish Audio has positioned v1.4 as a primary local alternative to proprietary APIs like ElevenLabs. For developers and engineers, Fish Speech v1.4 represents a shift toward "Speech-Language Models" (SLMs) that can be fully self-hosted for privacy-sensitive or latency-critical applications.

Architecture and Technical Details

The core of Fish Speech v1.4 is a Dual-Autoregressive (Dual-AR) architecture. This design separates the process into two distinct stages:

Semantic to Acoustic Mapping: The first stage converts input text into semantic tokens.
Acoustic Token Generation: The second stage predicts the VQ audio tokens.

Because it operates over VQ audio tokens, the model doesn't just "read" text; it predicts the prosody, emotion, and timbre of the speech based on the provided reference audio. While the exact parameter count remains undisclosed by Fish Audio, the architecture's reliance on dense transformer blocks suggests a performance profile similar to mid-sized LLMs.

The model supports a multilingual repertoire including English, Chinese, German, Japanese, French, Spanish, Korean, and Arabic. The training data is heavily weighted toward English and Chinese (roughly 300k hours each), with the remaining languages supported by approximately 20k hours of data each. This scale ensures that the model captures subtle nuances in dialect and tone that smaller, specialized TTS models often miss.

Capabilities and Use Cases

Fish Speech v1.4 excels in zero-shot voice cloning. By providing a short (5-10 second) audio sample, users can generate speech that mimics the target voice with high accuracy.

Expressive Narrations: Using emotion tags like [angry], [whispering], or [excited], practitioners can control the emotional output of the model. This is a critical feature for game developers creating NPC dialogue or creators producing audiobooks.
Low-Latency Conversational Agents: When optimized for local inference, the model is fast enough to power real-time voice assistants. Its ability to handle "filler" sounds like [laughing] or [sighing] makes interactions feel significantly more human than standard robotic TTS.
Multilingual Dubbing: For developers building translation pipelines, Fish Speech v1.4 provides consistent voice identity across different languages. You can clone a voice in English and have it speak fluent, natural-sounding Spanish or Japanese while maintaining the original speaker's characteristics.

Running Fish Speech v1.4 Locally

Running Fish Speech v1.4 locally requires a modern GPU with a focus on VRAM capacity and memory bandwidth. Because the model uses an LLM-based backbone, hardware requirements are more intensive than older models like Coqui TTS or Piper.

Hardware Requirements

Minimum VRAM: 8GB (for inference at lower precision or with smaller context).
Recommended VRAM: 16GB to 24GB. An NVIDIA RTX 3090 or 4090 is the gold standard for this model, providing the headroom needed for the Dual-AR architecture to run without bottlenecks.
Apple Silicon: Mac users with M2/M3/M4 Max chips (32GB+ Unified Memory) can run the model efficiently using the MPS (Metal Performance Shaders) backend.

Performance and Quantization

To maximize performance on consumer hardware, quantization is highly recommended. While the model is often distributed in FP16 or BF16, converting to 4-bit or 8-bit formats (using tools like GGUF or EXL2) can significantly reduce the VRAM footprint without a perceptible loss in audio quality.

Expected Performance: On an RTX 4090, you can expect real-time or faster-than-real-time generation.
Deployment Options: The most straightforward way to run the model is via the official Fish Speech GitHub repository, which includes a Gradio-based WebUI. For those integrated into the Ollama ecosystem, keep an eye on community manifests that wrap the inference engine for easier CLI access.

Best Quantization for Fish Speech v1.4

For most users, Q4_K_M or Q8_0 quantization provides the best balance. Since audio quality is highly sensitive to token "jitter," 8-bit quantization is generally preferred over 4-bit if your VRAM allows it, as it preserves the subtle emotional inflections in the VQ tokens more effectively.

How It Compares

Fish Speech v1.4 competes in a narrow field of high-end, open-weights TTS models.

Fish Speech v1.4 vs. XTTS v2: XTTS v2 (by Coqui) has long been the standard for local cloning. Fish Speech v1.4 generally offers better prosody and handles non-English languages with more natural cadence due to its larger 700k-hour training set. However, XTTS v2 may have lower VRAM requirements for simple tasks.
Fish Speech v1.4 vs. GPT-SoVITS: GPT-SoVITS is a strong competitor in the "few-shot cloning" space. Fish Speech v1.4 tends to be more stable during long-form generation (like audiobook chapters) and offers a more streamlined architecture for developers looking to integrate via API or local server.
Fish Speech v1.4 vs. F5-TTS: F5-TTS uses a diffusion-based transformer approach. While F5-TTS can be more robust against "hallucinated" speech sounds, Fish Speech v1.4's Dual-AR approach often results in more "alive" and emotionally varied performances.

License and Usage Constraints

Fish Speech v1.4 is released under the CC-BY-NC-SA-4.0 license. This is a "Non-Commercial" license, meaning you can use it for research, personal projects, and experimentation, but you cannot use the model weights for commercial profit without a separate agreement from Fish Audio. The source code itself is released under the more permissive BSD-3-Clause license. Practitioners should ensure their use case aligns with these legal boundaries before deploying in a production environment.

Fish Speech v1.4

Model Specifications

Quick Start

Download from Hugging Face

License

Performance & Scoring

Benchmarks

Overall Score

Hardware Compatibility

Find the best hardware for this model

Community

About This Model

Architecture and Technical Details

Capabilities and Use Cases

Running Fish Speech v1.4 Locally

Hardware Requirements

Performance and Quantization

Best Quantization for Fish Speech v1.4

How It Compares

License and Usage Constraints

Related Models

Fish Speech v1.5