The University of Texas at Austin

VoiceCraft 2.0

A neural codec language model for zero-shot speech editing and TTS in the wild, released by Puyuan Peng et al. at UT Austin.

0.33B paramsDense

View on Hugging Face Source Code Official Page

Model Specifications

Parameters0.33B

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

CC-BY-NC-SA-4.0View Full License

Performance & Scoring

Benchmarks

No benchmark data available for this model yet.

Overall Score

40.9CC

Benchmark40%

50.0

Popularity25%

22.0

Efficiency25%

37.8

Versatility10%

60.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

85 devices


Acer Veriton GN100 AI MiniAcer	SS	0.7 GB
AMD Instinct MI300XAMD	SS	0.7 GB
AMD Instinct MI325XAMD	SS	0.7 GB
AMD Instinct MI355XAMD	SS	0.7 GB
AMD Radeon RX 7600 8GBAMD	SS	0.7 GB
AMD Radeon RX 7700 XTAMD	SS	0.7 GB
AMD Radeon RX 7800 XTAMD	SS	0.7 GB
AMD Radeon RX 7900 XTAMD	SS	0.7 GB
AMD Radeon RX 7900 XTXAMD	SS	0.7 GB
AMD Radeon RX 9070AMD	SS	0.7 GB
AMD Radeon RX 9070 XTAMD	SS	0.7 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	0.7 GB
Apple M4Apple	SS	0.7 GB
Apple M4 Max (40-core GPU)Apple	SS	0.7 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	0.7 GB
Apple M5Apple	SS	0.7 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	0.7 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	0.7 GB
Apple Mac Mini (M1, 2020)Apple	SS	0.7 GB
Apple Mac Mini (M2, 2023)Apple	SS	0.7 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	0.7 GB
Apple Mac Mini (M4, 2024)Apple	SS	0.7 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	0.7 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	0.7 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	SS	0.7 GB

Rows per page

Page 1 of 4

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.

About This Model

Overview

VoiceCraft 2.0 is a 0.33B parameter dense neural codec language model developed by researchers at The University of Texas at Austin, in collaboration with FAIR (Meta) and Rembrand. It is designed for two specific tasks: zero-shot speech editing and zero-shot text-to-speech (TTS) on in-the-wild audio data. The model was released by Puyuan Peng and his team, and it targets practitioners who need to edit existing speech recordings or clone voices from minimal reference audio without any fine-tuning.

At 0.33B parameters, VoiceCraft 2.0 sits in the small-to-medium range for generative audio models. Its small footprint is intentional—it enables the model to run on consumer-grade hardware while still achieving state-of-the-art results on speech editing and zero-shot TTS benchmarks. The model operates on tokenized audio representations using a neural codec, which means it processes speech as discrete tokens rather than raw waveforms. This approach allows VoiceCraft 2.0 to perform token infilling: inserting, replacing, or generating speech tokens within an existing audio sequence.

What sets VoiceCraft 2.0 apart is its capability to handle challenging, realistic audio data. The model was evaluated on audiobooks, internet videos, and podcasts—data that includes diverse accents, varying speaking styles, different recording conditions, and background noise or music. It does not require clean, studio-quality recordings to produce usable results. For practitioners working with real-world audio, this robustness matters more than benchmark scores on sanitized datasets.

Architecture & Technical Details

VoiceCraft 2.0 uses a Transformer decoder architecture with a token rearrangement procedure that combines causal masking and delayed stacking. This design enables the model to generate tokens within an existing sequence rather than only predicting the next token in a left-to-right fashion. The token infilling approach is what makes speech editing possible—you can replace a segment of speech with new content while preserving the original speaker's voice, prosody, and recording environment.

The model is dense, meaning all 0.33B parameters are active during inference. There are no expert routing mechanisms or sparse activation patterns to manage. This simplifies deployment: you get predictable memory usage and consistent inference speed regardless of the input. For a dense model of this size, inference is straightforward to optimize with standard quantization and batching techniques.

No official context length is specified for VoiceCraft 2.0, but based on the model's design for speech editing and TTS, it processes audio in segments. The reference audio (the voice you want to clone or the recording you want to edit) typically needs to be only a few seconds long. The model does not require long context windows because it operates on compressed token sequences from the neural codec, not on raw audio samples.

The model is text-only in terms of input modality—you provide text transcripts for editing or target text for TTS. Audio is encoded into tokens by a separate neural codec model before being passed to VoiceCraft 2.0. This modular design means you can swap the codec or preprocessing pipeline without retraining the language model itself.

Capabilities & Use Cases

VoiceCraft 2.0 has two primary capabilities: zero-shot speech editing and zero-shot text-to-speech. Both are "zero-shot" in the sense that the model can work with voices it has never encountered during training, using only a few seconds of reference audio.

Speech editing is the model's standout feature. You provide an original audio recording, its transcript, and an edited transcript with the changes you want. VoiceCraft 2.0 generates the edited audio with the modified portion seamlessly integrated into the original recording. The output is nearly indistinguishable from unedited recordings in terms of naturalness, as evaluated by human listeners. Use cases include correcting misspoken words in podcast recordings, updating dialogue in video content without re-recording, and fixing errors in audiobook narration.

Zero-shot TTS allows you to clone an unseen voice from a few seconds of reference audio and generate new speech from arbitrary text. The model outperforms prior state-of-the-art models including VALLE and the commercial XTTS-v2 on realistic, challenging datasets. Unlike many TTS systems that degrade significantly with background noise or unusual speaking styles, VoiceCraft 2.0 maintains voice consistency and naturalness across diverse conditions.

The model is English-only based on available documentation. It is licensed under CC-BY-NC-SA-4.0, which permits non-commercial use, sharing, and adaptation with attribution. Commercial deployment requires separate licensing from the University of Texas at Austin.

Running VoiceCraft 2.0 Locally

VoiceCraft 2.0's 0.33B parameter count makes it feasible to run on consumer hardware, but the model's architecture and inference pipeline have specific requirements beyond just parameter count.

VRAM Requirements

At full FP16 precision, VoiceCraft 2.0 requires approximately 0.7-1.0 GB of VRAM for the model weights alone. However, the total memory footprint during inference depends on your audio processing pipeline, batch size, and whether you include the neural codec model in the same process. Realistic VRAM usage for the full inference pipeline (codec encoding + language model inference + audio decoding) is between 2-4 GB.

With 4-bit quantization (Q4_K_M), model weights drop to roughly 0.2-0.3 GB, and total pipeline memory usage falls to 1-2 GB. This makes the model runnable on GPUs with as little as 4 GB VRAM, including older or entry-level cards.

Hardware Recommendations

Minimum: Any GPU with 4 GB VRAM (GTX 1650, RTX 3050, M1 base). Suitable for Q4_K_M inference with short audio segments.
Recommended: GPU with 8 GB VRAM (RTX 3070, RTX 4060, M2 Pro). Supports FP16 inference with longer audio and batch processing.
Ideal: RTX 4090, M4 Max, or any 16+ GB GPU. Enables FP16 inference with large batch sizes and real-time or faster-than-real-time performance.

Performance Expectations

At 0.33B parameters, VoiceCraft 2.0 achieves fast inference on modern hardware. On an RTX 4090, expect 50-100 tokens per second for the language model component, translating to sub-second generation for typical speech editing segments. On an M4 Max, performance is comparable. On lower-end hardware with quantization, expect 20-40 tokens per second—still fast enough for interactive use.

Getting Started

The official GitHub repository provides multiple paths for local deployment: Docker images, Gradio web UI, and standalone inference scripts. For quick testing, the Gradio Colab notebook can be adapted for local use. The repository includes inference_tts.ipynb and inference_speech_editing.ipynb Jupyter notebooks that walk through the full pipeline step by step.

Best Quantization for VoiceCraft 2.0

For most users running on consumer hardware, Q4_K_M offers the best balance of quality and efficiency. The model is small enough that 4-bit quantization introduces minimal quality degradation for TTS and speech editing tasks. If you have 8+ GB VRAM and want maximum quality, run at FP16. There is no practical benefit to 8-bit quantization for this model size—you either save significant VRAM with 4-bit or run full precision.

How It Compares

VoiceCraft 2.0 occupies a specific niche: zero-shot speech editing and TTS for in-the-wild audio. The most direct comparison is with XTTS-v2, a popular commercial model. VoiceCraft 2.0 outperforms XTTS-v2 on realistic datasets with background noise, diverse accents, and varying recording quality. XTTS-v2 has the advantage of broader language support and a more polished deployment pipeline, but VoiceCraft 2.0 produces more natural results on challenging audio.

Compared to VALLE, another zero-shot TTS model, VoiceCraft 2.0 achieves better naturalness scores in human evaluation. VALLE was a research milestone, but VoiceCraft 2.0 improves upon it significantly, particularly for speech editing which VALLE does not support natively.

The tradeoff is that VoiceCraft 2.0 is research software. The deployment pipeline requires more setup than commercial alternatives, and the documentation assumes familiarity with neural audio codecs and Transformer inference. For practitioners who need a plug-and-play solution and can tolerate some quality loss on difficult audio, XTTS-v2 may be more practical. For those who need the best possible results on real-world audio and are comfortable with a research-grade codebase, VoiceCraft 2.0 is the better choice.

VoiceCraft 2.0

Model Specifications

Quick Start

Download from Hugging Face

License

Performance & Scoring

Benchmarks

Overall Score

Hardware Compatibility

Find the best hardware for this model

Community

About This Model

Overview

Architecture & Technical Details

Capabilities & Use Cases

Running VoiceCraft 2.0 Locally

VRAM Requirements

Hardware Recommendations

Performance Expectations

Getting Started

Best Quantization for VoiceCraft 2.0

How It Compares