A neural codec language model for zero-shot speech editing and TTS in the wild, released by Puyuan Peng et al. at UT Austin.
Access model weights, configuration files, and documentation.
No benchmark data available for this model yet.
See which devices can run this model and at what quality level.
VoiceCraft 2.0 is a 0.33B parameter dense neural codec language model developed by researchers at The University of Texas at Austin, in collaboration with FAIR (Meta) and Rembrand. It is designed for two specific tasks: zero-shot speech editing and zero-shot text-to-speech (TTS) on in-the-wild audio data. The model was released by Puyuan Peng and his team, and it targets practitioners who need to edit existing speech recordings or clone voices from minimal reference audio without any fine-tuning.
At 0.33B parameters, VoiceCraft 2.0 sits in the small-to-medium range for generative audio models. Its small footprint is intentional—it enables the model to run on consumer-grade hardware while still achieving state-of-the-art results on speech editing and zero-shot TTS benchmarks. The model operates on tokenized audio representations using a neural codec, which means it processes speech as discrete tokens rather than raw waveforms. This approach allows VoiceCraft 2.0 to perform token infilling: inserting, replacing, or generating speech tokens within an existing audio sequence.
What sets VoiceCraft 2.0 apart is its capability to handle challenging, realistic audio data. The model was evaluated on audiobooks, internet videos, and podcasts—data that includes diverse accents, varying speaking styles, different recording conditions, and background noise or music. It does not require clean, studio-quality recordings to produce usable results. For practitioners working with real-world audio, this robustness matters more than benchmark scores on sanitized datasets.
VoiceCraft 2.0 uses a Transformer decoder architecture with a token rearrangement procedure that combines causal masking and delayed stacking. This design enables the model to generate tokens within an existing sequence rather than only predicting the next token in a left-to-right fashion. The token infilling approach is what makes speech editing possible—you can replace a segment of speech with new content while preserving the original speaker's voice, prosody, and recording environment.
The model is dense, meaning all 0.33B parameters are active during inference. There are no expert routing mechanisms or sparse activation patterns to manage. This simplifies deployment: you get predictable memory usage and consistent inference speed regardless of the input. For a dense model of this size, inference is straightforward to optimize with standard quantization and batching techniques.
No official context length is specified for VoiceCraft 2.0, but based on the model's design for speech editing and TTS, it processes audio in segments. The reference audio (the voice you want to clone or the recording you want to edit) typically needs to be only a few seconds long. The model does not require long context windows because it operates on compressed token sequences from the neural codec, not on raw audio samples.
The model is text-only in terms of input modality—you provide text transcripts for editing or target text for TTS. Audio is encoded into tokens by a separate neural codec model before being passed to VoiceCraft 2.0. This modular design means you can swap the codec or preprocessing pipeline without retraining the language model itself.
VoiceCraft 2.0 has two primary capabilities: zero-shot speech editing and zero-shot text-to-speech. Both are "zero-shot" in the sense that the model can work with voices it has never encountered during training, using only a few seconds of reference audio.
Speech editing is the model's standout feature. You provide an original audio recording, its transcript, and an edited transcript with the changes you want. VoiceCraft 2.0 generates the edited audio with the modified portion seamlessly integrated into the original recording. The output is nearly indistinguishable from unedited recordings in terms of naturalness, as evaluated by human listeners. Use cases include correcting misspoken words in podcast recordings, updating dialogue in video content without re-recording, and fixing errors in audiobook narration.
Zero-shot TTS allows you to clone an unseen voice from a few seconds of reference audio and generate new speech from arbitrary text. The model outperforms prior state-of-the-art models including VALLE and the commercial XTTS-v2 on realistic, challenging datasets. Unlike many TTS systems that degrade significantly with background noise or unusual speaking styles, VoiceCraft 2.0 maintains voice consistency and naturalness across diverse conditions.
The model is English-only based on available documentation. It is licensed under CC-BY-NC-SA-4.0, which permits non-commercial use, sharing, and adaptation with attribution. Commercial deployment requires separate licensing from the University of Texas at Austin.
VoiceCraft 2.0's 0.33B parameter count makes it feasible to run on consumer hardware, but the model's architecture and inference pipeline have specific requirements beyond just parameter count.
At full FP16 precision, VoiceCraft 2.0 requires approximately 0.7-1.0 GB of VRAM for the model weights alone. However, the total memory footprint during inference depends on your audio processing pipeline, batch size, and whether you include the neural codec model in the same process. Realistic VRAM usage for the full inference pipeline (codec encoding + language model inference + audio decoding) is between 2-4 GB.
With 4-bit quantization (Q4_K_M), model weights drop to roughly 0.2-0.3 GB, and total pipeline memory usage falls to 1-2 GB. This makes the model runnable on GPUs with as little as 4 GB VRAM, including older or entry-level cards.
At 0.33B parameters, VoiceCraft 2.0 achieves fast inference on modern hardware. On an RTX 4090, expect 50-100 tokens per second for the language model component, translating to sub-second generation for typical speech editing segments. On an M4 Max, performance is comparable. On lower-end hardware with quantization, expect 20-40 tokens per second—still fast enough for interactive use.
The official GitHub repository provides multiple paths for local deployment: Docker images, Gradio web UI, and standalone inference scripts. For quick testing, the Gradio Colab notebook can be adapted for local use. The repository includes inference_tts.ipynb and inference_speech_editing.ipynb Jupyter notebooks that walk through the full pipeline step by step.
For most users running on consumer hardware, Q4_K_M offers the best balance of quality and efficiency. The model is small enough that 4-bit quantization introduces minimal quality degradation for TTS and speech editing tasks. If you have 8+ GB VRAM and want maximum quality, run at FP16. There is no practical benefit to 8-bit quantization for this model size—you either save significant VRAM with 4-bit or run full precision.
VoiceCraft 2.0 occupies a specific niche: zero-shot speech editing and TTS for in-the-wild audio. The most direct comparison is with XTTS-v2, a popular commercial model. VoiceCraft 2.0 outperforms XTTS-v2 on realistic datasets with background noise, diverse accents, and varying recording quality. XTTS-v2 has the advantage of broader language support and a more polished deployment pipeline, but VoiceCraft 2.0 produces more natural results on challenging audio.
Compared to VALLE, another zero-shot TTS model, VoiceCraft 2.0 achieves better naturalness scores in human evaluation. VALLE was a research milestone, but VoiceCraft 2.0 improves upon it significantly, particularly for speech editing which VALLE does not support natively.
The tradeoff is that VoiceCraft 2.0 is research software. The deployment pipeline requires more setup than commercial alternatives, and the documentation assumes familiarity with neural audio codecs and Transformer inference. For practitioners who need a plug-and-play solution and can tolerate some quality loss on difficult audio, XTTS-v2 may be more practical. For those who need the best possible results on real-world audio and are comfortable with a research-grade codebase, VoiceCraft 2.0 is the better choice.