MetaVoice

MetaVoice-1B

A 1.2B-parameter open-source English TTS foundation model combining a causal GPT over EnCodec tokens with multi-band diffusion.

1.2B paramsDense

View on Hugging Face Source Code Official Page

Model Specifications

Parameters1.2B

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

Apache 2.0View Full License

Performance & Scoring

Benchmarks

No benchmark data available for this model yet.

Overall Score

39.6DD

Benchmark40%

50.0

Popularity25%

36.7

Efficiency25%

15.6

Versatility10%

65.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

85 devices


Acer Veriton GN100 AI MiniAcer	SS	1.2 GB
AMD Instinct MI300XAMD	SS	1.2 GB
AMD Instinct MI325XAMD	SS	1.2 GB
AMD Instinct MI355XAMD	SS	1.2 GB
AMD Radeon RX 7600 8GBAMD	SS	1.2 GB
AMD Radeon RX 7700 XTAMD	SS	1.2 GB
AMD Radeon RX 7800 XTAMD	SS	1.2 GB
AMD Radeon RX 7900 XTAMD	SS	1.2 GB
AMD Radeon RX 7900 XTXAMD	SS	1.2 GB
AMD Radeon RX 9070AMD	SS	1.2 GB
AMD Radeon RX 9070 XTAMD	SS	1.2 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	1.2 GB
Apple M4Apple	SS	1.2 GB
Apple M4 Max (40-core GPU)Apple	SS	1.2 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	1.2 GB
Apple M5Apple	SS	1.2 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	1.2 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	1.2 GB
Apple Mac Mini (M1, 2020)Apple	SS	1.2 GB
Apple Mac Mini (M2, 2023)Apple	SS	1.2 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	1.2 GB
Apple Mac Mini (M4, 2024)Apple	SS	1.2 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	1.2 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	1.2 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	SS	1.2 GB

Rows per page

Page 1 of 4

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.

About This Model

Overview

MetaVoice-1B is a 1.2B-parameter open-source text-to-speech foundation model developed by MetaVoice, a team with prior experience building and commercializing frontier AI products. Trained on 100,000 hours of English speech, it combines a causal GPT architecture over EnCodec tokens with multi-band diffusion to produce emotionally expressive synthesized voice. The model is released under the Apache 2.0 license, making it freely usable for commercial and research applications without restrictions.

The primary differentiator for MetaVoice-1B is its focus on emotional speech rhythm and tone in English, along with zero-shot voice cloning for American and British accents using just 30 seconds of reference audio. It targets developers and product teams building voice-first applications — such as interactive agents, audiobook pipelines, accessibility tools, and conversational AI — who want to run synthesis locally rather than rely on cloud TTS APIs. The model competes in the mid-range TTS space, alongside systems like Tortoise TTS and StyleTTS 2, though it occupies a distinct position as an Apache 2.0-licensed model with explicit support for local deployment and fine-tuning.

Architecture & Technical Details

MetaVoice-1B uses a dense architecture with 1.2B parameters. The model operates in two main stages. First, a causal GPT predicts the first two hierarchies of EnCodec tokens from text and speaker conditioning information. Text and audio are both part of the LLM context window, and speaker identity is injected via conditioning at the token embedding layer, derived from a separately trained speaker verification network. The tokenization uses a custom-trained BPE tokenizer with 512 tokens. The two hierarchies are predicted in an interleaved manner — first token of hierarchy one, first token of hierarchy two, second token of hierarchy one, and so on. The model employs condition-free sampling to improve its zero-shot cloning capability.

A second, smaller non-causal transformer (~10M parameters) predicts the remaining six hierarchies from the first two, enabling parallel timestep prediction and strong zero-shot generalization across a wide range of speakers. Finally, multi-band diffusion generates waveforms from the EnCodec tokens, followed by DeepFilterNet post-processing to remove artifacts introduced by the diffusion stage.

The architecture includes support for KV-caching via Flash Decoding and batching of variable-length texts, both of which improve inference throughput in production scenarios. Context handling supports long-form synthesis, allowing arbitrary-length text to be processed in a single generation pass.

Capabilities & Use Cases

MetaVoice-1B excels at expressive English TTS with emphasis on emotional rhythm and natural prosody. Its key capabilities break down as follows:

Zero-shot voice cloning for American and British English voices using 30 seconds of reference audio, without any fine-tuning required.
Fine-tuneable voice cloning for cross-lingual and non-native accents. MetaVoice reports successful fine-tuning with as little as one minute of training data for Indian speakers, making it adaptable to a broader range of voices.
Long-form synthesis of arbitrary-length text, suitable for audiobook generation and document narration.
No text hallucinations — the model produces speech that faithfully reflects the input text without generating unintended content.

Practical use cases include building AI customer support agents with consistent brand voices, generating narration for video content, creating accessible reading assistants, prototyping voice interfaces for robotics, and powering interactive NPCs in games. The Apache 2.0 license removes the friction of commercial licensing negotiations, making it viable for product teams shipping to market.

Running MetaVoice-1B Locally

MetaVoice-1B is designed to run on consumer-grade GPU hardware, though its 1.2B parameter dense architecture demands meaningful VRAM. The official setup requires a GPU with at least 12GB of VRAM and Python 3.10–3.11. An RTX 4090 (24GB) provides a comfortable margin for running the full model in FP16, while an RTX 4080 (16GB) can handle the model with minor headroom. On older hardware, an RTX 3090 (24GB) remains a viable option.

For users with constrained VRAM, quantization is the standard approach. Q4_K_M is the recommended quantization level for most users — it strikes a balance between model quality and memory reduction, typically fitting the 1.2B model into roughly 700–800MB of VRAM after quantization. Q8_0 provides higher quality at the cost of roughly double the VRAM usage. Lower quantizations like Q3_K_M or Q2_K reduce memory further but introduce audible artifacts, particularly in prosody and timbre reproduction, which are central to this model's value proposition.

The quickest path to running MetaVoice-1B locally is via Ollama, which provides a simplified interface for loading and querying the model without manual environment setup. Developers who need more control can clone the official GitHub repository (metavoiceio/metavoice-src) and use the provided Docker Compose configuration for either a web UI or a REST API server. The server exposes endpoints documented at /docs for programmatic integration.

Performance varies based on hardware and batch size. On an RTX 4090, you can expect real-time or faster synthesis for typical sentence-length inputs. The Flash Decoding optimization improves throughput for longer sequences by accelerating autoregressive generation. When batching multiple texts of different lengths, the model's support for variable-length batching helps maintain reasonable throughput.

One practical consideration: MetaVoice-1B requires ffmpeg for audio preprocessing and postprocessing. Ensure it is installed and available in your PATH before running the model.

How It Compares

MetaVoice-1B vs. Tortoise TTS: Tortoise TTS is a long-established open-source TTS system that prioritizes output quality over speed, often requiring multiple passes and significant compute. MetaVoice-1B is faster due to its end-to-end architecture and supports streaming inference optimizations that Tortoise lacks. MetaVoice-1B's Apache 2.0 license is also more permissive than Tortoise's custom research license. However, Tortoise has been more thoroughly evaluated across a wider range of voices and use cases, and some practitioners report that Tortoise's output on certain voices is cleaner due to training on higher-quality audio data. Choose MetaVoice-1B for speed, licensing simplicity, and fine-tuning flexibility; choose Tortoise if maximum output fidelity for specific voices is the primary concern.

MetaVoice-1B vs. StyleTTS 2: StyleTTS 2 is a diffusion-based TTS model known for high naturalness and expressive style control. It operates in a different architectural paradigm and is not Apache 2.0 licensed by default. MetaVoice-1B's zero-shot cloning for American and British voices with 30 seconds of reference audio is competitive with StyleTTS 2's cloning capabilities, and MetaVoice-1B's fine-tuning requirements are lower — one minute of data versus the more substantial datasets typically needed for StyleTTS 2 adaptation. If you need fine-grained style control beyond voice cloning, StyleTTS 2 may have an edge. For developers prioritizing a ready-to-run open-weight model with strong default quality and a permissive license, MetaVoice-1B is the stronger choice.

Bottom line: MetaVoice-1B occupies a practical middle ground — more capable and easier to deploy than older open-source TTS systems, more permissive in licensing than some newer alternatives, and specifically optimized for local inference on consumer hardware. Its 1.2B parameter footprint is well-suited to the current generation of consumer GPUs, and its Apache 2.0 license removes the commercial ambiguity that complicates many other TTS options.

MetaVoice-1B

Model Specifications

Quick Start

Download from Hugging Face

License

Performance & Scoring

Benchmarks

Overall Score

Hardware Compatibility

Find the best hardware for this model

Community

About This Model

Overview

Architecture & Technical Details

Capabilities & Use Cases

Running MetaVoice-1B Locally

How It Compares