An expressive, zero-shot StyleTTS 2 fine-tune by ShoukanLabs intended as an improved base model for further TTS fine-tuning.
Access model weights, configuration files, and documentation.
No benchmark data available for this model yet.
See which devices can run this model and at what quality level.
Vokan TTS is a fine-tuned StyleTTS 2 model from ShoukanLabs, designed specifically for zero-shot text-to-speech synthesis with an emphasis on expressiveness. Unlike many TTS models that prioritize robotic consistency, Vokan targets natural prosody and vocal variation across unseen speakers. It is released under the MIT license, making it freely available for commercial and research use.
ShoukanLabs positions Vokan as a base model for further fine-tuning, not a finished product. This is an important distinction. If you need a drop-in TTS solution, you can use Vokan as-is for zero-shot voice cloning. But its primary design goal is to serve as a foundation—a stronger starting point than the original StyleTTS 2 for practitioners who want to fine-tune on their own voice datasets.
The model uses a dense architecture with undisclosed parameters. While the exact parameter count is not public, the model weights occupy approximately 14.3 GB on disk, which gives a practical reference point for hardware planning. Vokan was trained on over 6 days of audio data spanning 672 speakers, combining the AniSpeech, VCTK, and LibriTTS-R datasets. Training consumed 300 hours on a single H100 plus 600 hours on a single RTX 3090.
Vokan competes in the zero-shot TTS space alongside models like XTTS-v2 and Bark. Its advantage is being a direct fine-tune of StyleTTS 2, which is known for fast inference and strong prosody control. The tradeoff is that it is English-only and lacks the multilingual support that some alternatives offer.
Vokan is built on StyleTTS 2, a model architecture that separates style encoding from text encoding. This design allows the model to adapt to new speakers without retraining—you provide a reference audio sample, and the model extracts a style vector that controls pitch, rhythm, and timbre. The text is processed through a separate encoder, and the two streams are combined during decoding.
The architecture is dense, meaning all parameters are active during inference. This contrasts with mixture-of-experts (MoE) models where only a subset of parameters activate per token. For TTS workloads, dense architectures tend to produce more consistent voice quality because every inference uses the full model capacity. The tradeoff is higher VRAM usage compared to an equivalently-sized MoE model.
Because the parameter count is undisclosed, you cannot directly compare Vokan to other models on parameter efficiency alone. The 14.3 GB model size suggests a parameter count in the range of 2-4 billion, but this is speculative. What matters for practitioners is the practical hardware footprint.
Vokan accepts text-only input and outputs audio. The context length is not specified, but StyleTTS 2 typically handles inputs up to several hundred characters well. For longer texts, you would chunk the input and concatenate the audio output. The model supports zero-shot voice cloning from a single reference audio sample, with no additional training required.
Vokan's primary capability is zero-shot text-to-speech with expressive prosody. Given a short reference audio clip (typically 3-10 seconds), it can synthesize new speech in that voice with natural pitch variation, pacing, and emotional inflection. This is not robotic concatenation—the model generates novel speech that matches the reference speaker's characteristics.
Concrete use cases include:
The model is English-only. Training data included VCTK (British English), LibriTTS-R (American English), and AniSpeech (expressive English). Accents and dialects within English are well-represented, but Vokan will not handle other languages.
Vokan requires a GPU for practical inference speeds. CPU inference is technically possible but impractically slow for real-time use.
Minimum VRAM: 8 GB. This will run the model at full precision (FP16) but leaves little headroom. You may need to reduce batch size or use shorter audio outputs.
Recommended VRAM: 12-16 GB. An RTX 3060 12GB, RTX 4070, or similar will run Vokan comfortably at FP16 with room for processing. An RTX 4090 or 24 GB card gives you flexibility for larger batch processing or longer audio generation.
Quantization: Vokan supports standard quantization methods. For most users, Q4_K_M offers the best balance of quality and VRAM efficiency. This reduces the model footprint to approximately 5-7 GB, making it runnable on 8 GB cards with comfortable headroom. Q5_K_M preserves more quality but requires about 8-9 GB. Q8_0 is near-lossless but pushes VRAM requirements above 12 GB.
Consumer hardware that works:
Expected performance: At FP16 on an RTX 4090, expect real-time or faster generation (generating 10 seconds of audio in under 10 seconds). On a 12 GB card with Q4_K_M, expect 0.5-1x real-time speed. These numbers vary significantly based on audio length and system configuration.
Quickest way to start: The easiest path is to use the Hugging Face Space or clone the repository directly from ShoukanLabs. The model files are available at ShoukanLabs/Vokan on Hugging Face. You will need PyTorch and the StyleTTS 2 inference code. There is no Ollama integration for TTS models as of now—you will run the Python inference script directly.
Vokan vs. XTTS-v2: XTTS-v2 supports multiple languages and has a larger community. Vokan produces more natural prosody on English text, particularly for expressive or emotional speech. XTTS-v2 is more robust for multilingual use but can sound flatter on English emotional content. Choose Vokan if English expressiveness is your priority and you do not need other languages.
Vokan vs. Bark: Bark offers built-in sound effects, music, and non-speech audio. Vokan has cleaner voice quality and faster inference. Bark is larger (requires more VRAM) and slower. For pure voice synthesis, Vokan is the better choice. Bark wins if you need ambient sounds or singing.
Vokan vs. original StyleTTS 2: Vokan is a direct improvement. The fine-tuning on diverse, expressive data gives it better zero-shot performance and more natural prosody. If you already use StyleTTS 2, Vokan is a straightforward upgrade with the same inference pipeline. The tradeoff is slightly larger model size due to the additional training data incorporated.