A 1.2B-parameter open-source English TTS foundation model combining a causal GPT over EnCodec tokens with multi-band diffusion.
Access model weights, configuration files, and documentation.
No benchmark data available for this model yet.
See which devices can run this model and at what quality level.
MetaVoice-1B is a 1.2B-parameter open-source text-to-speech foundation model developed by MetaVoice, a team with prior experience building and commercializing frontier AI products. Trained on 100,000 hours of English speech, it combines a causal GPT architecture over EnCodec tokens with multi-band diffusion to produce emotionally expressive synthesized voice. The model is released under the Apache 2.0 license, making it freely usable for commercial and research applications without restrictions.
The primary differentiator for MetaVoice-1B is its focus on emotional speech rhythm and tone in English, along with zero-shot voice cloning for American and British accents using just 30 seconds of reference audio. It targets developers and product teams building voice-first applications — such as interactive agents, audiobook pipelines, accessibility tools, and conversational AI — who want to run synthesis locally rather than rely on cloud TTS APIs. The model competes in the mid-range TTS space, alongside systems like Tortoise TTS and StyleTTS 2, though it occupies a distinct position as an Apache 2.0-licensed model with explicit support for local deployment and fine-tuning.
MetaVoice-1B uses a dense architecture with 1.2B parameters. The model operates in two main stages. First, a causal GPT predicts the first two hierarchies of EnCodec tokens from text and speaker conditioning information. Text and audio are both part of the LLM context window, and speaker identity is injected via conditioning at the token embedding layer, derived from a separately trained speaker verification network. The tokenization uses a custom-trained BPE tokenizer with 512 tokens. The two hierarchies are predicted in an interleaved manner — first token of hierarchy one, first token of hierarchy two, second token of hierarchy one, and so on. The model employs condition-free sampling to improve its zero-shot cloning capability.
A second, smaller non-causal transformer (~10M parameters) predicts the remaining six hierarchies from the first two, enabling parallel timestep prediction and strong zero-shot generalization across a wide range of speakers. Finally, multi-band diffusion generates waveforms from the EnCodec tokens, followed by DeepFilterNet post-processing to remove artifacts introduced by the diffusion stage.
The architecture includes support for KV-caching via Flash Decoding and batching of variable-length texts, both of which improve inference throughput in production scenarios. Context handling supports long-form synthesis, allowing arbitrary-length text to be processed in a single generation pass.
MetaVoice-1B excels at expressive English TTS with emphasis on emotional rhythm and natural prosody. Its key capabilities break down as follows:
Practical use cases include building AI customer support agents with consistent brand voices, generating narration for video content, creating accessible reading assistants, prototyping voice interfaces for robotics, and powering interactive NPCs in games. The Apache 2.0 license removes the friction of commercial licensing negotiations, making it viable for product teams shipping to market.
MetaVoice-1B is designed to run on consumer-grade GPU hardware, though its 1.2B parameter dense architecture demands meaningful VRAM. The official setup requires a GPU with at least 12GB of VRAM and Python 3.10–3.11. An RTX 4090 (24GB) provides a comfortable margin for running the full model in FP16, while an RTX 4080 (16GB) can handle the model with minor headroom. On older hardware, an RTX 3090 (24GB) remains a viable option.
For users with constrained VRAM, quantization is the standard approach. Q4_K_M is the recommended quantization level for most users — it strikes a balance between model quality and memory reduction, typically fitting the 1.2B model into roughly 700–800MB of VRAM after quantization. Q8_0 provides higher quality at the cost of roughly double the VRAM usage. Lower quantizations like Q3_K_M or Q2_K reduce memory further but introduce audible artifacts, particularly in prosody and timbre reproduction, which are central to this model's value proposition.
The quickest path to running MetaVoice-1B locally is via Ollama, which provides a simplified interface for loading and querying the model without manual environment setup. Developers who need more control can clone the official GitHub repository (metavoiceio/metavoice-src) and use the provided Docker Compose configuration for either a web UI or a REST API server. The server exposes endpoints documented at /docs for programmatic integration.
Performance varies based on hardware and batch size. On an RTX 4090, you can expect real-time or faster synthesis for typical sentence-length inputs. The Flash Decoding optimization improves throughput for longer sequences by accelerating autoregressive generation. When batching multiple texts of different lengths, the model's support for variable-length batching helps maintain reasonable throughput.
One practical consideration: MetaVoice-1B requires ffmpeg for audio preprocessing and postprocessing. Ensure it is installed and available in your PATH before running the model.
MetaVoice-1B vs. Tortoise TTS: Tortoise TTS is a long-established open-source TTS system that prioritizes output quality over speed, often requiring multiple passes and significant compute. MetaVoice-1B is faster due to its end-to-end architecture and supports streaming inference optimizations that Tortoise lacks. MetaVoice-1B's Apache 2.0 license is also more permissive than Tortoise's custom research license. However, Tortoise has been more thoroughly evaluated across a wider range of voices and use cases, and some practitioners report that Tortoise's output on certain voices is cleaner due to training on higher-quality audio data. Choose MetaVoice-1B for speed, licensing simplicity, and fine-tuning flexibility; choose Tortoise if maximum output fidelity for specific voices is the primary concern.
MetaVoice-1B vs. StyleTTS 2: StyleTTS 2 is a diffusion-based TTS model known for high naturalness and expressive style control. It operates in a different architectural paradigm and is not Apache 2.0 licensed by default. MetaVoice-1B's zero-shot cloning for American and British voices with 30 seconds of reference audio is competitive with StyleTTS 2's cloning capabilities, and MetaVoice-1B's fine-tuning requirements are lower — one minute of data versus the more substantial datasets typically needed for StyleTTS 2 adaptation. If you need fine-grained style control beyond voice cloning, StyleTTS 2 may have an edge. For developers prioritizing a ready-to-run open-weight model with strong default quality and a permissive license, MetaVoice-1B is the stronger choice.
Bottom line: MetaVoice-1B occupies a practical middle ground — more capable and easier to deploy than older open-source TTS systems, more permissive in licensing than some newer alternatives, and specifically optimized for local inference on consumer hardware. Its 1.2B parameter footprint is well-suited to the current generation of consumer GPUs, and its Apache 2.0 license removes the commercial ambiguity that complicates many other TTS options.