ShoukanLabs

Vokan TTS

An expressive, zero-shot StyleTTS 2 fine-tune by ShoukanLabs intended as an improved base model for further TTS fine-tuning.

B paramsDense

View on Hugging Face Source Code Official Page

Model Specifications

ParametersnullB

Quick Start

Download from Hugging Face

Access model weights, configuration files, and documentation.

Download from Hugging Face

License

MITView Full License

Performance & Scoring

Benchmarks

No benchmark data available for this model yet.

Overall Score

45.2CC

Benchmark40%

50.0

Popularity25%

6.0

Efficiency25%

68.9

Versatility10%

65.0

Hardware Compatibility

See which devices can run this model and at what quality level.

Hide F tierOnly featured devices

85 devices


Acer Veriton GN100 AI MiniAcer	SS	0.5 GB
AMD Instinct MI300XAMD	SS	0.5 GB
AMD Instinct MI325XAMD	SS	0.5 GB
AMD Instinct MI355XAMD	SS	0.5 GB
AMD Radeon RX 7600 8GBAMD	SS	0.5 GB
AMD Radeon RX 7700 XTAMD	SS	0.5 GB
AMD Radeon RX 7800 XTAMD	SS	0.5 GB
AMD Radeon RX 7900 XTAMD	SS	0.5 GB
AMD Radeon RX 7900 XTXAMD	SS	0.5 GB
AMD Radeon RX 9070AMD	SS	0.5 GB
AMD Radeon RX 9070 XTAMD	SS	0.5 GB
Apple M3 Ultra (32-core CPU, 80-core GPU)Apple	SS	0.5 GB
Apple M4Apple	SS	0.5 GB
Apple M4 Max (40-core GPU)Apple	SS	0.5 GB
Apple M4 Pro (14-core CPU, 20-core GPU)Apple	SS	0.5 GB
Apple M5Apple	SS	0.5 GB
Apple M5 Max (18-core CPU, 40-core GPU)Apple	SS	0.5 GB
Apple M5 Pro (18-core CPU, 20-core GPU)Apple	SS	0.5 GB
Apple Mac Mini (M1, 2020)Apple	SS	0.5 GB
Apple Mac Mini (M2, 2023)Apple	SS	0.5 GB
Apple Mac Mini (M2 Pro, 2023)Apple	SS	0.5 GB
Apple Mac Mini (M4, 2024)Apple	SS	0.5 GB
Apple Mac Mini (M4 Pro, 2024)Apple	SS	0.5 GB
Apple Mac Studio (M1 Max, 2022)Apple	SS	0.5 GB
Apple Mac Studio (M1 Ultra, 2022)Apple	SS	0.5 GB

Rows per page

Page 1 of 4

Find the best hardware for this model

Use our hardware calculator to find the optimal device for running this model.

About This Model

Overview

Vokan TTS is a fine-tuned StyleTTS 2 model from ShoukanLabs, designed specifically for zero-shot text-to-speech synthesis with an emphasis on expressiveness. Unlike many TTS models that prioritize robotic consistency, Vokan targets natural prosody and vocal variation across unseen speakers. It is released under the MIT license, making it freely available for commercial and research use.

ShoukanLabs positions Vokan as a base model for further fine-tuning, not a finished product. This is an important distinction. If you need a drop-in TTS solution, you can use Vokan as-is for zero-shot voice cloning. But its primary design goal is to serve as a foundation—a stronger starting point than the original StyleTTS 2 for practitioners who want to fine-tune on their own voice datasets.

The model uses a dense architecture with undisclosed parameters. While the exact parameter count is not public, the model weights occupy approximately 14.3 GB on disk, which gives a practical reference point for hardware planning. Vokan was trained on over 6 days of audio data spanning 672 speakers, combining the AniSpeech, VCTK, and LibriTTS-R datasets. Training consumed 300 hours on a single H100 plus 600 hours on a single RTX 3090.

Vokan competes in the zero-shot TTS space alongside models like XTTS-v2 and Bark. Its advantage is being a direct fine-tune of StyleTTS 2, which is known for fast inference and strong prosody control. The tradeoff is that it is English-only and lacks the multilingual support that some alternatives offer.

Architecture & Technical Details

Vokan is built on StyleTTS 2, a model architecture that separates style encoding from text encoding. This design allows the model to adapt to new speakers without retraining—you provide a reference audio sample, and the model extracts a style vector that controls pitch, rhythm, and timbre. The text is processed through a separate encoder, and the two streams are combined during decoding.

The architecture is dense, meaning all parameters are active during inference. This contrasts with mixture-of-experts (MoE) models where only a subset of parameters activate per token. For TTS workloads, dense architectures tend to produce more consistent voice quality because every inference uses the full model capacity. The tradeoff is higher VRAM usage compared to an equivalently-sized MoE model.

Because the parameter count is undisclosed, you cannot directly compare Vokan to other models on parameter efficiency alone. The 14.3 GB model size suggests a parameter count in the range of 2-4 billion, but this is speculative. What matters for practitioners is the practical hardware footprint.

Vokan accepts text-only input and outputs audio. The context length is not specified, but StyleTTS 2 typically handles inputs up to several hundred characters well. For longer texts, you would chunk the input and concatenate the audio output. The model supports zero-shot voice cloning from a single reference audio sample, with no additional training required.

Capabilities & Use Cases

Vokan's primary capability is zero-shot text-to-speech with expressive prosody. Given a short reference audio clip (typically 3-10 seconds), it can synthesize new speech in that voice with natural pitch variation, pacing, and emotional inflection. This is not robotic concatenation—the model generates novel speech that matches the reference speaker's characteristics.

Concrete use cases include:

Voice cloning for content creation: Generate narration in a specific voice without hiring a voice actor for every session. Vokan handles varied sentence structures and emotional tones better than many alternatives.
Audiobook production: The expressive range allows for character differentiation and natural reading flow, though you would need to manage long-form text through chunking.
Game dialogue prototyping: Quickly generate placeholder voice lines in different character voices using reference samples.
Fine-tuning foundation: If you have a specific voice dataset, Vokan provides a better starting point than the original StyleTTS 2 for further training. ShoukanLabs explicitly designed it for this purpose.

The model is English-only. Training data included VCTK (British English), LibriTTS-R (American English), and AniSpeech (expressive English). Accents and dialects within English are well-represented, but Vokan will not handle other languages.

Running Vokan TTS Locally

Vokan requires a GPU for practical inference speeds. CPU inference is technically possible but impractically slow for real-time use.

Minimum VRAM: 8 GB. This will run the model at full precision (FP16) but leaves little headroom. You may need to reduce batch size or use shorter audio outputs.

Recommended VRAM: 12-16 GB. An RTX 3060 12GB, RTX 4070, or similar will run Vokan comfortably at FP16 with room for processing. An RTX 4090 or 24 GB card gives you flexibility for larger batch processing or longer audio generation.

Quantization: Vokan supports standard quantization methods. For most users, Q4_K_M offers the best balance of quality and VRAM efficiency. This reduces the model footprint to approximately 5-7 GB, making it runnable on 8 GB cards with comfortable headroom. Q5_K_M preserves more quality but requires about 8-9 GB. Q8_0 is near-lossless but pushes VRAM requirements above 12 GB.

Consumer hardware that works:

RTX 3060 12GB: Runs Q4_K_M comfortably, Q5_K_M with caution
RTX 4070/4070 Ti: Runs all quantization levels, including FP16
RTX 4090: Handles the model easily, supports large batch processing
M4 Max (64GB unified memory): Runs FP16 without issue; performance depends on memory bandwidth
M3 Pro (18GB): Runs Q4_K_M with acceptable performance

Expected performance: At FP16 on an RTX 4090, expect real-time or faster generation (generating 10 seconds of audio in under 10 seconds). On a 12 GB card with Q4_K_M, expect 0.5-1x real-time speed. These numbers vary significantly based on audio length and system configuration.

Quickest way to start: The easiest path is to use the Hugging Face Space or clone the repository directly from ShoukanLabs. The model files are available at ShoukanLabs/Vokan on Hugging Face. You will need PyTorch and the StyleTTS 2 inference code. There is no Ollama integration for TTS models as of now—you will run the Python inference script directly.

How It Compares

Vokan vs. XTTS-v2: XTTS-v2 supports multiple languages and has a larger community. Vokan produces more natural prosody on English text, particularly for expressive or emotional speech. XTTS-v2 is more robust for multilingual use but can sound flatter on English emotional content. Choose Vokan if English expressiveness is your priority and you do not need other languages.

Vokan vs. Bark: Bark offers built-in sound effects, music, and non-speech audio. Vokan has cleaner voice quality and faster inference. Bark is larger (requires more VRAM) and slower. For pure voice synthesis, Vokan is the better choice. Bark wins if you need ambient sounds or singing.

Vokan vs. original StyleTTS 2: Vokan is a direct improvement. The fine-tuning on diverse, expressive data gives it better zero-shot performance and more natural prosody. If you already use StyleTTS 2, Vokan is a straightforward upgrade with the same inference pipeline. The tradeoff is slightly larger model size due to the additional training data incorporated.

Vokan TTS

Model Specifications

Quick Start

Download from Hugging Face

License

Performance & Scoring

Benchmarks

Overall Score

Hardware Compatibility

Find the best hardware for this model

Community

About This Model

Overview

Architecture & Technical Details

Capabilities & Use Cases

Running Vokan TTS Locally

How It Compares